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Abstract. Low-rank matrix approximations, such as the truncated singular value decomposition 
and the rank-revealing QR decomposition, play a central role in data analysis and scientific com- 
puting. This work surveys recent research which demonstrates that randomization offers a powerful 
tool for performing low-rank matrix approximation. These techniques exploit modern computational 
architectures more fully than classical methods and open the possibility of dealing with truly massive 
data sets. In particular, these techniques offer a route toward principal component analysis (PCA) 
for petascale data. 

This paper presents a modular framework for constructing randomized algorithms that compute 
partial matrix decompositions. These methods use random sampling to identify a subspace that 
captures most of the action of a matrix. The input matrix is then compressed — either explicitly or 
implicitly — to this subspace, and the reduced matrix is manipulated deterministically to obtain the 
desired low-rank factorization. In many cases, this approach beats its classical competitors in terms 
of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments 
and a detailed error analysis. 

The specific benefits of randomized techniques depend on the computational environment. Con- 
sider the model problem of finding the k dominant components of the singular value decomposition 
of an m X n matrix, (i) For a dense input matrix, randomized algorithms require O(r?inlog(fc)) 
floating-point operations (flops) in contrast with O(mnfc) for classical algorithms, (ii) For a sparse 
input matrix, the flop count matches classical Krylov subspace methods, but the randomized ap- 
proach is more robust and can be reorganized to exploit multi-processor architectures, (iii) For a 
matrix that is too large to fit in slow memory, the randomized techniques require only a constant 
number of passes over the data, as opposed to 0(fc) passes for classical algorithms. In fact, it is 
sometimes possible to perform matrix approximation with a single pass over the data. 

Key words. Dimension reduction, eigenvalue decomposition, intcrpolativc decomposition, 
Johnson-Lindenstrauss lemma, matrix approximation, parallel algorithm, pass-efficient algorithm, 
principal component analysis, randomized algorithm, random matrix, rank-revealing QR factoriza- 
tion, singular value decomposition, streaming algorithm. 

AMS subject classifications. [MSC2010] Primary: 65F30. Secondary: 68W20, 60B20. 

Part I: Introduction 

1. Overview. On a well-known list of the "Top 10 Algorithms" that have in- 
fluenced the practice of science and engineering during the 20th century [42] , we find 
an entry that is not really an algorithm: the idea of using matrix factorizations to 
accomplish basic tasks in numerical linear algebra. In the accompanying article [126], 
Stewart explains that 

The underlying principle of the decompositional approach to matrix 
computation is that it is not the business of the matrix algorith- 
micists to solve particular problems but to construct computational 
platforms from which a variety of problems can be solved. 

Stewart goes on to argue that this point of view has had many fruitful consequences, 
including the development of robust software for performing these factorizations in a 
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highly accurate and provably correct manner. 

The deconipositional approach to matrix computation remains fundamental, but 
we believe the time has come to consider alternative techniques for computing these 
decompositions. Several reasons motivate this claim: 

• A salient feature of modern applications, especially in data mining, is that 
the matrices are stupendously big. Classical algorithms are not always well 
adapted to solving the type of large-scale problems that now arise. 

• In the information sciences, it is common that data are missing or inaccurate. 
Classical algorithms are designed to produce highly accurate matrix decompo- 
sitions, but it seems profligate to spend extra computational resources when 
the imprecision of the data inherently limits the resolution of the output. 

• Data transfer now plays a major role in the computational cost of numerical 
algorithms. Techniques that require few passes over the data may be substan- 
tially faster in practice, even if they require as many — or more — floating-point 
operations. 

• As the structure of computer processors continues to evolve, it becomes in- 
creasingly important for numerical algorithms to adapt to a range of novel 
architectures, such as graphics processing units. 

The purpose of this paper is to survey and extend recent work, including [17,46, 
58,94,105,111,112,118,135], which has demonstrated that randomized algorithms 
provide a powerful tool for constructing approximate matrix factorizations. These 
techniques are simple and effective, sometimes shockingly so. Compared with stan- 
dard deterministic algorithms, the randomized methods are often faster and — perhaps 
surprisingly — more robust. Furthermore, they can produce factorizations that are ac- 
curate to any specifled tolerance above machine precision, which allows the user to 
trade accuracy for speed. It is even possible to construct factorizations more accurate 
than those produced by classical methods. We present numerical evidence that these 
algorithms succeed for real computational problems. 

The paper surveys randomized techniques for computing low-rank approxima- 
tions. It is designed to help practitioners identify situations where randomized tech- 
niques may outperform established methods. Many of the basic ideas in this treatment 
are drawn from the literature, but we have assembled the components in our own fash- 
ion. This approach clarifies how random techniques interact with classical techniques 
to yield effective, modern algorithms supported by detailed theoretical guarantees. 

Remark 1.1. Our experience suggests that many practitioners of scientific com- 
puting view randomized algorithms as a desperate and final resort. Let us address 
this concern immediately. Classical Monte Carlo methods are highly sensitive to the 
random number generator and typically produce output with low and uncertain ac- 
curacy. In contrast, the algorithms discussed herein are insensitive to the quality 
of randomness and produce highly accurate results. The probability of failure is a 
user-specified parameter that can be rendered negligible (say, less than 10~^'^) with a 
nominal impact on the computational resources required. 

1.1. Approximation by low-rank matrices. The roster of standard matrix 
decompositions includes the pivoted QR factorization, the eigenvalue decomposition, 
and the singular value decomposition (SVD), all of which expose the (numerical) range 
of a matrix. Truncated versions of these factorizations are often used to express a 
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low-rank approximation of a given matrix: 



A 



B C, 

m X k k X n. 



(1.1) 



m X n 



The inner dimension k is sometimes called the numerical rank of the matrix. When 
the numerical rank is much smaller than either dimension m or n, a factorization such 
as (1.1) allows the matrix to be stored inexpensively and to be multiplied rapidly with 
vectors or other matrices. The factorizations can also be used for data interpretation 
or to solve computational problems, such as least squares. 

Matrices with low numerical rank appear in a wide variety of scientific applica- 
tions. We list only a few: 

• A basic method in statistics and data mining is to compute the directions of 
maximal variance in vector-valued data by performing principal component 
analysis (PCA) on data matrix. PCA is nothing other than a low-rank matrix 
approximation [70, §14.5]. 

• Another standard technique in data analysis is to perform low-dimensional 
embedding of data under the assumption that there are fewer degrees of 
freedom than the ambient dimension would suggest. In many cases, the 
method reduces to computing a partial SVD of a matrix derived from the 
data. See [70, §§14.8-14.9] or [33]. 

• The problem of estimating parameters from measured data via least-squares 
fitting often leads to very large systems of linear equations that arc close to 
linearly dependent. Effective techniques for factoring the coefficient matrix 
lead to efficient techniques for solving the least-squares problem, [112]. 

• Many fast numerical algorithms for solving PDEs and for rapidly evaluating 
potential fields such as the fast multipole method [66] and H- matrices [65], 
rely on low-rank approximations of continuum operators with exponentially 
decaying spectra. 

• Models of multiscale physical phenomena often involve PDEs with rapidly 
oscillating coefficients. Techniques for model reduction or coarse graining in 
such environments are often based on the observation that the linear trans- 
form that maps the input data to the requested output data can be approxi- 
mated by an operator of low rank [55] . 

1.2. Matrix approximation framework. The task of computing a low-rank 
approximation to a given matrix can be split naturally into two computational stages. 
The first is to construct a low-dimensional subspace that captures the action of the 
matrix. The second is to restrict the matrix to the subspace and then compute a 
standard factorization (QR, SVD, etc.) of the reduced matrix. To be slightly more 
formal, we subdivide the computation as follows. 

Stage A. Compute an approximate basis for the range of the input matrix A. In 
other words, we require a matrix Q for which 



We would like the basis matrix Q to contain as few columns as possible, but 
it is not critical to obtain the absolute minimum number at this stage as long 
as the approximation of the input matrix is accurate. 



Q has orthonormal columns and A w QQ*A. 



(1.2) 
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Stage B. Given a matrix Q that satisfies (1.2), we use Q to help compute a standard 
factorization (QR, SVD, etc.) of A. 

The task in Stage A can be executed very efficiently with random sampling meth- 
ods, and it is the primary subject of this work. In the next subsection, we offer an 
overview of these ideas. The body of the paper provides details of the algorithms (§4) 
and a theoretical analysis of their performance (§§8-11). 

Stage B can be completed with well-established deterministic methods. Sec- 
tion 3.3.3 contains an introduction to these techniques, and §5 shows how we apply 
them to produce low-rank factorizations. 

At this stage in the development, it may not be clear why the output from Stage A 
facilitates our job in Stage B. Let us illustrate by describing how to obtain an ap- 
proximate SVD of A given a matrix Q that satisfies (1.2). More precisely, wc wish to 
compute matrices U and V with orthonormal columns and a nonnegative, diagonal 
matrix S such that A w VSIV*. This goal is achieved after three simple steps: 

1. Form B ^ Q*A, which yields the low-rank factorization A « QB. 

2. Compute an SVD of the small matrix: B = VEV*. 

3. Set U = QU. 

When Q has few columns, this procedure is efficient because we can construct 
B easily and compute its SVD rapidly. In practice, we can often avoid forming B 
explicitly by means of subtler techniques. In some cases, it is not even necessary 
to revisit the input matrix A during Stage B. This observation allows us to develop 
single-pass algorithms^ which look at each entry of A only once. 

Similar manipulations readily yield other standard factorizations, such as the 
pivoted QR factorization, the eigenvalue decomposition, etc. 

1.3. Randomized algorithms. This paper describes a class of randomized al- 
gorithms for completing Stage A of the matrix approximation framework set forth 
in §1.2. We begin with more details about the approximation problem these algo- 
rithms target (§1.3.1). Afterward, we motivate the random sampling technique with 
a heuristic explanation (§1.3.2) that leads to a prototype algorithm (§1.3.3). 

1.3.1. Problem formulations. The basic challenge in producing low-rank ma- 
trix approximations is a primitive question that we call the fixed-precision approxi- 
mation problem. Suppose we are given a matrix A and a positive error tolerance e. 
We seek a matrix Q with k = k{e) orthonormal columns such that 

i|A-QQ*A|| < £, (1.3) 

where ||-|| denotes the £2 operator norm. The range of Q is a fc-dimensional subspace 
that captures most of the action of A, and we would like k to be as small as possible. 

The singular value decomposition furnishes an optimal answer to the fixed-precision 
problem [99]. Let aj denote the jth largest singular value of A. For each j > 0, 

min ||A-B||=a,+i. (1.4) 

rank(i3)<j 

One way to construct a minimizcr is to choose B ~ QQ*A, where the columns of Q 
are k dominant left singular vectors of A. Consequently, the minimal rank k where 
(1.3) holds equals the number of singular values of A that exceed the tolerance e. 

To simplify the development of algorithms, it is convenient to assume that the 
desired rank k is specified in advance. We call the resulting problem the fixed-rank 
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approximation problem. Given a matrix A, a target rank k, and an oversampling 
parameter p, we seek to construct a matrix Q with k + p orthonormal columns such 
that 

\\A-QQ*A\\f=i min (1.5) 

rank(B)<fc 

Although there exists a mininiizer Q that solves the fixed rank problem for p = 0, the 
opportunity to use a small number of additional columns provides a flexibility that is 
crucial for the effectiveness of the computational methods we discuss. 

We will demonstrate that algorithms for the fixed-rank problem can be adapted 
to solve the fixed-precision problem. The connection is based on the observation that 
we can build the basis matrix Q incrementally and, at any point in the computation, 
we can inexpensively estimate the residual error \\A — QQ*A\\. Refer to §4.4 for the 
details of this reduction. 

1.3.2. Intuition. To understand how randomness helps us solve the fixed-rank 
problem, it is helpful to consider some motivating examples. 

First, suppose that we seek a basis for the range of a matrix A with exact rank 
k. Draw a random vector u), and form the product y = Auj. For now, the precise 
distribution of the random vector is unimportant; just think of y as a random sample 
from the range of A. Let us repeat this sampling process k times: 

yW=Aa;(*), i = l,2,...,fc. 

Owing to the randomness, the set {w^'^} of random vectors is likely to be in general 
linear position. In particular, the random vectors form a linearly independent set and 
no linear combination falls in the null space of A. As a result, the set {y'-'-'j^Li of 
sample vectors is also linearly independent, so it spans the range of A. Therefore, to 
produce an orthonormal basis for the range of A, we just need to orthonormalize the 
sample vectors. 

Now, imagine that A = B + E where B is a rank-fc matrix and E is a small 
perturbation. If we generate exactly k samples from the range of A, it is unlikely 
that the linear span of these vectors aligns closely with the range of B owing to the 
perturbation E. As a result, we might miss directions that contribute significantly to 
the action of A. To improve the situation, we fix a small integer p representing some 
degree of oversampling, and we generate k + p samples 

y(')==Aa;('), i^l,2,...,k+p. 

Remarkably, a very small amount of oversampling is sufficient for the span of the 
samples to capture the range of B to high accuracy with negligible failure probability. 
In fact, p = 5 or p = 10 often gives superb results. 

1.3.3. A prototype algorithm. The intuitive approach of §1.3.2 can be ap- 
plied to general matrices. Omitting computational details for now, we formalize the 
procedure in the figure labeled Proto- Algorithm. 

This simple algorithm is by no means new. It is essentially the first step of 
an orthogonal iteration with a random initial subspace [61, §7.3.2]. The novelty 
comes from the additional observation that the initial subspace should have a slightly 
higher dimension than the invariant subspace we are trying to approximate. With 
this revision, it is often the case that no further iteration is required to obtain a 
high-quality solution to (1.5). We believe this idea can be traced to [94, 105]. 
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Proto- Algorithm: Solving the Fixed-Rank Problem 

Given an m x n matrix A, a target rank k, and an oversampling parameter 
p, this procedure computes an m x (k + p) matrix Q whose columns are 
orthonormal and whose range approximates the range of A. 

1 Draw a random n x [k + p) test matrix fJ. 

2 Form the matrix product Y = A^l. 

3 Construct a matrix Q whose columns form an orthonormal basis for the 
range of Y . 



In order to invoke the proto-algorithm with confidence, we must address several 
practical and theoretical issues: 

• What random matrix Jl should we use? How much oversampling do we need? 

• How is the basis Q determined from the sample matrix Yl 

• What are the computational costs? 

• How can we solve the fixed-precision problem (1.3) when the numerical rank 
of the matrix is not known in advance? 

• How can wc use the basis Q to compute other matrix factorizations? 

• Does the randomized method work for problems of practical interest? How 
does its speed/accuracy/robustness compare with standard techniques? 

• What error bounds can we expect? With what probability? 

The next few sections provide a summary of the answers to these questions. We 
describe several problem regimes where the proto-algorithm can be implemented effi- 
ciently, and we present a theorem that describes the performance of the most impor- 
tant instantiation. Finally, we elaborate on how these ideas can be applied to perform 
principal component analysis (PCA) of large data matrices. The rest of the paper 
contains a more exhaustive treatment — including pseudocode, numerical experiments, 
and a detailed theory. 

1.4. A comparison between randomized and traditional techniques. To 

select an appropriate computational method for finding a low-rank approximation to 
a matrix, the practitioner must take into account the properties of the matrix. Is 
it dense or sparse? Does it fit in fast memory or is it stored out of core? Does the 
singular spectrum decay quickly or slowly? The behavior of a numerical linear algebra 
algorithm may depend on all these factors [13, 61, 131]. To facilitate a comparison be- 
tween classical and randomized techniques, we summarize their relative performance 
in each of three representative environments. Section 6 contains a more in-depth 
treatment. 

We focus on the task of computing an approximate SVD of an m x n matrix A 
with numerical rank k. For randomized schemes. Stage A generally dominates the 
cost of Stage B in our matrix approximation framework (§1.2). Within Stage A, the 
computational bottleneck is usually the matrix-matrix product Afl in Step 2 of the 
proto-algorithm (§1.3.3). The power of randomized algorithms stems from the fact 
that we can reorganize this matrix multiplication for maximum efficiency in a variety 
of computational architectures. 

1.4.1. A general dense matrix that fits in fast memory. A modern de- 
terministic technique for approximate SVD is to perform a rank-revealing QR fac- 
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torization of the matrix [68] and then to manipulate the factors to obtain the final 
factorization. The cost of this approach is Oikmn) floating-point operations, or flops. 

In contrast, randomized schemes can produce an approximate SVD using only 
0(mnlog(fc) + (m + n)k'^) flops. The gain in asymptotic complexity is achieved by 
using a random matrix that has some internal structure, which allows us to evaluate 
the product A^l rapidly. For example, randomizing and subsampling the discrete 
Fourier transform works well. Sections 4.6 and 11 contain more information on this 
approach. 

1.4.2. A matrix for which matrix vector products can be evaluated 
rapidly. When A is sparse or structured, it can often be applied rapidly to a vector. 
In this case, the classical technique of choice is a Krylov subspace method, such 
as the Lanczos or Arnoldi algorithm. These techniques have an asymptotic cost 
of ©(fcTmuit + (to + n)fc^) flops, where Tnmit denotes the cost of a matrix-vector 
multiplication. Note, however, that these approaches are delicate to implement and 
their convergence can be slow when the matrix has an unfortunate singular spectrum. 

In this environment, we can apply randomized methods using a Gaussian test 
matrix O to complete the factorization at the same cost, 0(fc T^^\t + (m + n)k'^) flops. 
Random schemes have at least two distinct advantages over Krylov methods. First, 
the randomized methods produce highly accurate matrix approximations even in the 
absence of gaps/jumps in the singular spectrum (which is not the case for the classical 
Lanczos and Arnoldi methods). Second, the matrix-vector multiplies required to form 
A^l can be performed in parallel. The flexibility to restructure the computation often 
leads to dramatic accelerations in practice. 

1.4.3. A general dense matrix stored in slow memory or streamed. 

When the input matrix is too large to flt in core memory, the cost of transferring the 
matrix from slow memory typically dominates the cost of performing the arithmetic. 
The standard techniques for low-rank approximation described in §1.4.1 require 0(fc) 
passes over the matrix, which can be prohibitively expensive. 

In contrast, the proto-algorithm of §1.3.3 requires only one pass over the data to 
produce the approximate basis Q for Stage A of the approximation framework. This 
straightforward approach, unfortunately, is not accurate enough for matrices whose 
singular spectrum decays slowly, but we can address this problem using very few (say, 
2 to 4) additional passes over the data [111]. See §1.6 or §4.5 for more discussion. 

Typically, Stage B uses one additional pass over the matrix to construct the ap- 
proximate SVD. With slight modiflcations, however, the two-stage randomized scheme 
can be revised so that it only makes a single pass over the data. Refer to §5.4 for 
information. 

1.5. Performance analysis. A principal goal of this paper is to provide a de- 
tailed analysis of the performance of the proto-algorithm described in §1.3.3. This 
investigation produces precise error bounds, expressed in terms of the singular values 
of the input matrix. Furthermore, we determine how several choices of the random 
matrix impact the behavior of the algorithm. 

Let us offer a taste of these results. The following theorem describes the average- 
case performance of the proto-algorithm with a Gaussian test matrix. It is a simplified 
version of Theorem 10.6. 

Theorem 1.1. Suppose that A is a real m x n matrix. Select a target rank k 
and an oversampling parameter p > 2, where k+p< min{m, n}. Execute the proto- 
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algorithm with a standard Gaussian test matrix to obtain an m x (fc + matrix Q 
with orthonormal columns. Then 



where E denotes expectation with respect to the random test matrix and Ufc+i is the 
(k + l)th singular value of A. 

It is natural that the error bound contains one copy of ak+i owing to (1.4). At 
worst, the algorithm produces a basis whose error is larger by a small polynomial 
factor. Observe that a moderate amount of oversampling results in a substantial 
performance gain. Moreover, the error bound (1.6) in the randomized algorithm is 
slightly sharper than comparable bounds for deterministic techniques based on rank- 
revealing QR algorithms [68]. 

The reader might be worried about whether the expectation provides a useful 
account of the approximation error. Fear not: the actual outcome of the algorithm is 
almost always the typical outcome because of measure concentration effects. As we 
discuss in §10.3, the probability that the error satisfies 



is at least 1 — 6 • p~P under very mild assumptions on p. This fact justifies the use of 
an oversampling term as small as p = 5. This simplified estimate is very similar to 
the major results in [94]. 

The theory developed in this paper provides much more detailed information 
about the performance of the proto-algorithm. 

• When the singular values of A decay slightly, the error \\A — QQ*A\\ does 
not depend on the dimensions of the matrix (§§10.2-10.3). 

• We can reduce the size of the bracket in the error bound (1.6) by combining 
the proto-algorithm with a power iteration (§10.4). For an example, see §1.6 
below. 

• For the structured random matrices we mentioned in §1.4.1, related error 
bounds are in force (§11). 

• We can obtain inexpensive a posteriori error estimates to verify the quality 
of the approximation (§4.3). 

1.6. Example: Principal Component Analysis. We conclude this introduc- 
tion with a short discussion of how these ideas allow us to perform principal com- 
ponent analysis (PCA) on large data matrices, which is a compelling application of 
randomized matrix approximation [111]. 

Suppose that A is an m x n matrix holding statistical data, where each column 
of A carries the data from one experiment and each row contains measurements of a 
single variable. We form the centered matrix A by subtracting the empirical mean 
from each variable. In other terms, = dij — (l/«) that each row of A 

sums to zero. The first k principal components of the data (i.e., the directions of 
maximal variance) are the k dominant left singular vectors of the input matrix A. 

We can evidently perform PCA by computing an approximate SVD of the cen- 
tered matrix A, and the two-stage randomized method offers a natural approach. 
Unfortunately, the simplest version of this scheme is inadequate for PCA because the 



E\\A- QQ*A\\ < 1 + 



4Vfc + p 



■ V min{m, n} ak+i, 



(1.6) 



||A-gg*A||< 1 + 11 v/fc + p • v/min{m, n} at+i 



(1.7) 
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Algorithm: Randomized PCA 



Given an m x n, matrix A, the number k of principal components, and an 
exponent q, this procedure computes an approximate rank-2k factorization 
UHV* . The columns of U estimate the first 2k principal components of A. 

Stage A: 

1 Generate an n x 2k Gaussian test matrix $7. 

2 Form Y = {AA*YAi}, by multiplying alternately with A and A* 

3 Construct a matrix Q whose columns form an orthonormal basis for the 
range of Y. 

Stage B: 

1 Form B = Q*A. 

2 Compute an SVD of the small matrix: B ^ VEV* . 

3 Set [/ = QU. 



singular spectrum of the data matrix often decays quite slowly. To address this diffi- 
culty, we incorporate q steps of a power iteration where g = 1,2 is usually sufficient 
in practice. The complete scheme appears below as the Randomized PCA algorithm. 
For refinements, see the discussion in §§4~5. 

This procedure requires only 2{q+l) passes over the matrix, so it is efficient even 
for matrices stored out-of-core. The flop count is 



where Tmuit is the cost of a matrix-vector multiply with A or A*. 

We have the following theorem on the performance of this method, which is a 
consequence of Corollary 10.10. 

Theorem 1.2. Suppose that A is a real m x n matrix. Select an exponent q and 
a target number k of principal components, where 2 < k < 0.5min{m,n}. Execute 
the randomized PCA algorithm to obtain a rank-2k factorization UHV* . Then 



where E denotes expectation with respect to the random test matrix and ak+i is the 
(k + l)th singular value of A. 

This result is new. Observe that the bracket in (1.8) is essentially the same as the 
bracket in the basic error bound (1.6). We find that the power iteration drives the 
leading constant to one exponentially fast as the power q increases. Since the rank-fc 
approximation of A can never achieve an error smaller than ak+i, the randomized 
procedure estimates 2k principal components that carry essentially as much variance 
as the first k actual principal components. Truncating the approximation to the first 
k terms typically produces very accurate results. 

1.7. Outline of paper. The paper is organized into three parts: an introduction 
(§§1-3), a description of the algorithms (§§4-7), and a theoretical performance analysis 
(§§8-11). Each part commences with a short internal outline, and the three segments 
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are more or less self-contained. After a brief review of our notation in §§3.1-3.2, the 
reader can proceed to either the algorithms or the theory part. 

2. Related work and historical context. The idea of using randomness to 
develop algorithms for numerical linear algebra appears to be fairly recent, although 
we can trace the intellectual origins back to earlier work in computer science and — 
especially — to probabilistic methods in geometric analysis. This section presents an 
aerial view of this young field. We begin with a survey of randomized methods for 
matrix approximation; then we attempt to trace some of the ideas backward to their 
sources. We apologize in advance for any inaccuracies or omissions, and we welcome 
any feedback that might improve this account. 

2.1. Randomized matrix approximation. Matrices of low numerical rank 
contain little information relative to their apparent dimension owing to the linear de- 
pendency in their columns (or rows). As a result, it is reasonable to expect that these 
matrices can be approximated with far fewer degrees of freedom. A less obvious fact 
is that randomized schemes can be used to produce these approximations efficiently. 

Several types of approximation techniques build on this idea. These methods all 
follow the same basic pattern: 

1. Preprocess the matrix, usually to calculate sampling probabilities. 

2. Take random samples from the matrix, where the term sample refers generi- 
cally to a linear function of the matrix. 

3. Postprocess the samples to compute a final approximation, typically with 
classical techniques from numerical linear algebra. This step may require 
another look at the matrix. 

We continue with a description of the most common approximation schemes. 

2.1.1. Sparsification. The simplest approach to matrix approximation is the 
method of sparsification or the related technique of quantization. The goal of this 
process is to replace the matrix by a surrogate that contains far fewer nonzero entries; 
quantization further restricts these nonzero entries to a small set of values. Sparsifica- 
tion can be used to reduce storage or to accelerate computations by reducing the cost 
of matrix-vector and matrix-matrix multiplies [96, Ch. 6]. The paper [36] describes 
applications in optimization. 

Sparsification typically involves very simple elementwise calculations. Each entry 
in the approximation is drawn independently at random from a distribution deter- 
mined from the corresponding entry of the input matrix. The expected value of the 
random approximation equals the original matrix, but the distribution is designed so 
that a typical realization is much sparser. 

The first method of this form was devised by Achiloptas and McSherry [2], who 
built on earlier work on graph sparsification due to Karger [76,77]. Arora-Hazan- 
Kale presented a different sampling method in [7]. See [60, 123] for some recent work 
on sparsification. 

2.1.2. Column selection methods. A second approach to matrix approxima- 
tion is based on the idea that a small set of columns describes most of the action of 
a numerically low-rank matrix. Indeed, classical existential results [117] demonstrate 
that every m x n matrix A contains a fc-column submatrix C for which 
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where A: is a parameter, the dagger f denotes the pscudoinverse, and is a best 
rank-fc approximation of A. In general, it is NP-hard to produce a submatrix C 
that minimizes the left-hand side [30]. Nevertheless, there are efficient deterministic 
algorithms, such as the rank-revealing QR method of [68], that can nearly achieve the 
error bound (2.1). 

There is a class of randomized algorithms that approach the fixed-rank approx- 
imation problem (1.5) using this intuition. These methods first compute a sampling 
probability for each column, either using the squared Euclidean norms of the columns 
or their leverage scores. (Leverage scores reflect the relative importance of the columns 
to the action of the matrix; they can be calculated easily from the dominant k right 
singular vectors of the matrix.) Columns are then selected randomly according to this 
distribution. Afterward, a postprocessing step is invoked to produce a more refined 
approximation of the matrix. 

We believe that the earliest method of this form appeared in a 1998 paper 
of Frieze-Kannan-Vempala [57,58]. Later, this work was refined substantially by 
Drineas-Kannan-Mahoney [46] . The basic algorithm samples columns from a distri- 
bution related to the squared £2 norms of the columns. This sampling step produces 
a small column submatrix whose range is aligned with the range of the input matrix. 
The final approximation is obtained from a truncated SVD of the submatrix. Given a 
target rank k and a parameter e > 0, this approach produces a rank- A: approximation 
B that satisfies 

||A-B||p<||A-A(fe)||p-|-e||A||p, (2.2) 

where ||-||p denotes the Frobenius norm. The algorithm of [46] is computationally 
efficient and requires only a constant number of passes over the data. 

Rudelson and Vershynin later showed that the same column sampling method 
also yields spectral- norm error bounds [116]. The techniques in their paper have 
been very influential; their work has already found other applications in randomized 
regression [51], sparse approximation [132], and compressive sampling [21]. 

Deshpande et al. [39, 40] demonstrated that the error in the column sampling 
approach can be improved by iteration and adaptive sampling. They showed that it 
is possible to produce a rank-fc matrix B that satisfies 

1|A-B||p<(l + £)||A-A(,)||p (2.3) 

using a fc-pass algorithm. Around the same time, Har-Peled [69] independently de- 
veloped a recursive algorithm that offers the same approximation guarantees. 

Drineas et al. and Boutsidis et al. have also developed randomized algorithms 
for the column subset selection problem, which requests a column submatrix C that 
achieves a bound of the form (2.1). Using the methods of Rudelson and Vershynin [116], 
they showed that sampling columns according to their leverage scores is likely to 
produce the required submatrix [49,50]. Subsequent work [17, 18] showed that post- 
processing the sampled columns with a rank-revealing QR algorithm can reduce the 
number of output columns required while improving the classical existential error 
bound (2.1). The argument in [17] explicitly decouples the linear algebraic part of 
the analysis from the random matrix theory. The theoretical analysis in the present 
work relies on a very similar technique. 

2.1.3. Approximation by dimension reduction. A third approach to matrix 
approximation is based on the concept of dimension reduction. Since the rows of a 
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low-rank matrix arc linearly dependent, they can be embedded into a low-dimensional 
space without altering their geometric properties substantially. A random linear map 
provides an efficient, nonadaptive way to perform this embedding. (Column sampling 
can also be viewed as an adaptive form of dimension reduction.) 

The proto-algorithm we set forth in §1.3.3 is simply a dual description of the 
dimension reduction approach: collecting random samples from the column space of 
the matrix is equivalent to reducing the dimension of the rows. No precomputation 
is required to obtain the sampling distribution, but the sample itself takes some work 
to collect. Afterward, we orthogonalize the samples as preparation for constructing 
various matrix approximations. 

We believe that the idea of using dimension reduction for algorithmic matrix 
approximation ffist appeared in a 1998 paper of Papadimitriou et al. [104, 105], who 
described an application to latent semantic indexing (LSI). They suggested projecting 
the input matrix onto a random subspace and compressing the original matrix to (a 
subspacc of) the range of the projected matrix. They established error bounds that 
echo the result (2.2) of Frieze et al. [58]. Although the Euclidean column selection 
method is a more computationally efficient way to obtain this type of error bound, 
dimension reduction has other advantages, e.g., in terms of accuracy. 

Somewhat later, Martinsson-Rokhlin-Tygert studied dimension reduction using a 
Gaussian transform matrix, and they demonstrated that the approach performs much 
better than earlier analyses had suggested [94] . Their work highlights the importance 
of oversampling, and their error bounds are very similar to the estimate (1.7) we 
presented in the introduction. They also demonstrated that dimension reduction can 
be used to compute an interpolative decomposition of the input matrix, which is 
essentially equivalent to performing column subset selection. 

Sarlos argued in [118] that the computational costs of dimension reduction can be 
reduced substantially by using structured random maps proposed by Ailon-Chazelle [3] 
Sarlos used these ideas to develop efficient randomized algorithms for least-squares 
problems; he also studied approximate matrix multiplication and low-rank matrix 
approximation. The recent paper [103] analyzes a very similar matrix approximation 
algorithm using Rudelson and Vershynin's methods [116]. 

The initial work on structured dimension reduction did not immediately yield al- 
gorithms for low-rank matrix approximation that were superior to classical techniques. 
Woolfe et al. showed how to obtain an improvement in asymptotic computational cost, 
and they applied these techniques to problems in scientific computing [135]. Related 
work includes [88,90]. Very recently, Clarkson and Woodruff have streamlined these 
methods to obtain single-pass algorithms that are suitable for the streaming-data 
model [32]. 

Rokhlin-Szlam-Tygert have shown that combining dimension reduction with a 
power iteration is an effective way to improve its performance. These ideas lead to 
very efficient randomized methods for large-scale PC A [111]. Related ideas appear in 
work of Roweis [113]. 

2.1.4. Approximation by submatrices. The matrix approximation literature 
contains a subgenre that discusses methods for building an approximation from a 
submatrix and computed coefficient matrices. For example, we can construct an 
approximation using a subcoUection of columns (the interpolative decomposition), a 
subcollection of rows and a subcoUection of columns (the CUR decomposition), or a 
square submatrix (the matrix skeleton). This type of decomposition was developed 
and studied in several papers, including [29,64,125]. For data analysis applications. 
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see the recent paper [92]. 

A number of works develop randomized algorithms for this class of matrix ap- 
proximations. Drineas et al. have developed techniques for computing CUR decom- 
positions, which express A k, CUR, where C and R denote small column and row 
submatrices of A and where U is a small linkage matrix. These methods identify 
columns (rows) that approximate the range (corange) of the matrix; the linkage matrix 
is then computed by solving a small least-squares problem. A randomized algorithm 
for CUR approximation with controlled absolute error appears in [47] ; a relative error 
algorithm appears in [50]. We also mention a paper on computing a closely related 
factorization called the compact matrix decomposition [128]. 

It is also possible to produce interpolative decompositions and matrix skeletons 
using randomized methods, as discussed in [94, 111]. 

2.1.5. Other numerical problems. The literature contains a variety of other 
randomized algorithms for solving standard problems in and around numerical linear 
algebra. We list some of the basic references. 

Tensor skeletons. Randomized column selection methods can be used to produce 

CUR- type decompositions of higher-order tensors [48] . 
Matrix multiplication. Column selection and dimension reduction techniques can 

be used to accelerate the multiplication of rank-deficient matrices [45,118]. 

Sec also [10]. 

Overdetermined linear systems. The randomized Kaczmarz algorithm is a lin- 
early convergent iterative method that can be used to solve overdetermined 
linear systems [102,127]. 

Overdetermined least squares. Fast dimension-reduction maps can sometimes ac- 
celerate the solution of overdetermined least-squares problems [51, 118]. 

Nonnegative least squares. Fast dimension reduction maps can be used to reduce 
the size of nonnegative least-squares problems [16]. 

Preconditioned least squares. Randomized matrix approximations can be used 
to precondition conjugate gradient to solve least-squares problems [112]. 

Other regression problems. Randomized algorithms for £i regression are described 
in [31]. Regression in £p for p G [1, oo) has also been considered [34]. 

Facility location. The Fermat-Weber facility location problem can be viewed as 
matrix approximation with respect to a different discrepancy measure. Ran- 
domized algorithms for this type of problem appear in [121]. 

2.1.6. Compressive sampling. Although randomized matrix approximation 
and compressive sampling are based on some common intuitions, it is facile to con- 
sider either one as a subspecies of the other. We offer a short overview of the field 
of compressive sampling — especially the part connected with matrices — so we can 
highlight some of the differences. 

The theory of compressive sampling starts with the observation that many types 
of vector-space data are compressible. That is, the data are approximated well using 
a short linear combination of basis functions drawn from a fixed collection [44] . For 
example, natural images are well approximated in a wavelet basis; numerically low- 
rank matrices are well approximated as a sum of rank-one matrices. The idea behind 
compressive sampling is that suitably chosen random samples from this type of com- 
pressible object carry a large amount of information. Furthermore, it is possible to 
reconstruct the compressible object from a small set of these random samples, often 
by solving a convex optimization problem. The initial discovery works of Candes- 
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Rombcrg-Tao [22] and Donoho [43] were written in 2004. 

The earliest work in compressive sampling focused on vector-valued data; soon 
after, researchers began to study compressive sampling for matrices. In 2007, Rccht- 
Fazel-Parillo demonstrated that it is possible to reconstruct a rank-deficient matrix 
from Gaussian measurements [110]. More recently, Candes-Recht [24] and Candes- 
Tao [26] considered the problem of completing a low-rank matrix from a random 
sample of its entries. Matrix completion can be viewed as the problem that is dual 
to sparsification. 

The usual goals of compressive sampling are (i) to design a method for collecting 
informative, nonadaptive data about a compressible object and (ii) to reconstruct a 
compressible object given some measured data. In both cases, there is an implicit 
assumption that we have limited — if any — access to the underlying data. 

In the problem of matrix approximation, we typically have a complete representa- 
tion of the matrix at our disposal. The point is to compute a simpler representation as 
efficiently as possible under some operational constraints. In particular, we would like 
to perform as little computation as we can, but we are usually allowed to revisit the 
input matrix. Because of the different focus, randomized matrix approximation algo- 
rithms require fewer random samples from the matrix and use fewer computational 
resources than compressive sampling reconstruction algorithms. 

2.2. Origins. This section attempts to identify some of the major threads of 
research that ultimately led to the development of the randomized techniques we 
discuss in this paper. 

2.2.1. Random embeddings. The field of random embeddings is a major pre- 
cursor to randomized matrix approximation. In a celebrated 1984 paper [75], Johnson 
and Lindenstrauss showed that the pairwise distances among a collection of N points 
in a Euclidean space are approximately maintained when the points are mapped 
randomly to a Euclidean space of dimension O (log TV). In other words, random em- 
beddings preserve Euclidean geometry. Shortly afterward, Bourgain showed that ap- 
propriate random low-dimensional embeddings preserve the geometry of point sets in 
finite-dimensional £i spaces [15]. 

These observations suggest that we might be able to solve some computational 
problems of a geometric nature more efficiently by translating them into a lower- 
dimensional space and solving them there. This idea was cultivated by the theoretical 
computer science community beginning in the late 1980s, with research flowering in 
the late 1990s. In particular, nearest-neighbor search can benefit from dimension- 
reduction techniques [73,79,81]. The papers [57,104] were apparently the first to 
apply this approach to linear algebra. 

Around the same time, researchers became interested in simplifying the form 
of dimension reduction maps and improving the computational cost of applying the 
map. Several researchers developed refined results on the performance of a Gaussian 
matrix as a linear dimension reduction map [35, 73, 95]. Achlioptas demonstrated that 
discrete random matrices would serve nearly as well [1]. In 2006, Ailon and Chazelle 
proposed the fast Johnson-Lindenstrauss transform [3] , which combines the speed of 
the FFT with the favorable embedding properties of a Gaussian matrix. Subsequent 
refinements appear in [4,89]. Sarlos then imported these techniques to study several 
problems in numerical linear algebra, which has led to some of the fastest algorithms 
currently available [90,135]. 
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2.2.2. Data streams. Muthukrishnan argues that a distinguishing feature of 
modern data is the manner in which it is presented to us. The sheer volume of infor- 
mation and the speed at which it must be processed tax our abihty to transmit the 
data elsewhere, to compute complicated functions on the data, or to store a substan- 
tial part of the data [101, §3]. As a result, computer scientists have started to develop 
algorithms that can address familiar computational problems under these novel con- 
straints. The data stream phenomenon is one of the primary justifications cited by [45] 
for developing pass-efficient methods for numerical linear algebra problems, and it is 
also the focus of the recent treatment [32] . 

One of the methods for dealing with massive data sets is to maintain sketches, 
which are small summaries that allow functions of interest to be calculated. In the 
simplest case, a sketch is simply a random projection of the data, but it might be 
a more sophisticated object [101, §5.1]. The idea of sketching can be traced to the 
fundamental work of Alon et al. [5, 6]. 

2.2.3. Numerical linear algebra. Classically, the field of numerical linear al- 
gebra has focused on developing deterministic algorithms that produce highly ac- 
curate matrix approximations with provable guarantees. Nevertheless, randomized 
techniques have appeared in several environments. 

One of the original examples is the use of random models for arithmetical errors, 
which was pioneered by von Neumann and Goldstine [133,134]. These two papers 
stand among the first works to study the properties of random matrices. The earliest 
numerical linear algebra algorithm that uses randomized techniques is apparently 
Dixon's method for estimating norms and condition numbers [41]. 

Another situation where randomness commonly arises is the initialization of iter- 
ative methods for computing invariant subspaces. For example, most numerical linear 
algebra texts advocate random selection of the starting vector for the power method 
because it ensures that the vector has a nonzero component in the direction of a dom- 
inant eigenvector. Wozniakowski and coauthors have analyzed the performance of the 
power method and the Lanczos iteration given a random starting vector [80,87]. 

Among other interesting applications of randomness, we mention the work by 
Parker and Pierce, which applies a randomized FFT to eliminate pivoting in Gaus- 
sian elimination [106], work by Demmel et al. who have studied randomization in 
connection with the stability of fast methods for linear algebra [38], and work by Le 
and Parker utilizing randomized methods for stabilizing fast linear algebraic compu- 
tations based on recursive algorithms (such as Strassen's) [82]. 

2.2.4. Scientific computing. One of the first algorithmic applications of ran- 
domness is the method of Monte Carlo integration introduced by Von Neumann and 
Ulam [97], and its extensions, such as the Metropolis algorithm for simulations in sta- 
tistical physics. (See [9] for an introduction). The most basic technique is to estimate 
an integral by sampling m points from the measure and computing an empirical mean 
of the integrand evaluated at the sample locations: 



where Xi are independent and identically distributed according to the probability 
measure fi. The law of large numbers (usually) ensures that this approach produces 
the correct result in the limit as m — > cx). Unfortunately, the approximation error 
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typically has a standard deviation of to~^/^, and the method provides no certificate 
of success. 

The disappointing computational profile of Monte Carlo integration seems to 
have inspired a distaste for randomized approaches within the scientific computing 
community. Fortunately, there are many other types of randomized algorithms — such 
as the ones in this paper — that do not suffer from the same shortcomings. 

2.2.5. Geometric functional analysis. There is one more character that plays 
a central role in our story: the probabilistic method in geometric analysis. Many of 
the algorithms and proof techniques ultimately come from work in this beautiful but 
recondite corner of mathematics. 

Dvoretsky's theorem [52] states (roughly) that every infinite-dimensional Banach 
space contains an n-dimensional subspace whose geometry is essentially the same as 
an n-dimensional Hilbert space, where n is an arbitrary natural number. In 1971, 
V. D. Milman developed an audacious new proof of this result by showing that a 
random n-dimensional subspace of an iV-dimensional Banach space has this property 
with exceedingly high probability, provided that N is large enough [98]. Milman's 
article is cited as the debut of the concentration of measure phenomenon, which is 
a geometric interpretation of the classical idea that regular functions of independent 
random variables rarely deviate far from their mean. This work opened a new era in 
geometric analysis where the probabilistic method became a basic instrument. 

Another prominent example of measure concentration is Kashin's computation 
of the Gel'fand widths of the £i ball [78], subsequently refined in [59]. This work 
showed that a random (A''— n)-dimensional projection of the A^-dimensional £i ball has 
an astonishingly small Euclidean diameter: approximately ■\/(l + log{N/ n))/n. In 
contrast, a nonzero projection of the £2 ball always has Euclidean diameter one. This 
basic geometric fact undergirds recent developments in compressive sampling [23]. 

We have already described a third class of examples: the randomized embeddings 
of Johnson-Lindenstrauss [75] and of Bourgain [15]. 

Finally, we mention Maurey's technique of empirical approximation. The original 
work was unpublished; one of the earliest applications appears in [27, §1]. Although 
Maurey's idea has not received as much press as the examples above, it can lead 
to simple and efficient algorithms for sparse approximation. For some examples in 
machine learning, consider [8, 86, 109, 119] 

The importance of random constructions in the geometric analysis community 
has led to the development of powerful techniques for studying random matrices. 
Classical random matrix theory focuses on a detailed asymptotic analysis of the spec- 
tral properties of special classes of random matrices. In contrast, geometric analysts 
know methods for determining the approximate behavior of rather complicated finite- 
dimensional random matrices. See [37] for a fairly current survey article. We also 
make special mention of the works of Rudelson [114] and Rudelson-Vershynin [116], 
which describe powerful tools for studying random matrices drawn from certain dis- 
crete distributions. Their papers are rooted deeply in the field of geometric functional 
analysis, but they reach out toward computational applications. The analysis in the 
present paper leans heavily on ideas drawn from the references in this paragraph. 

3. Linear algebraic preliminaries. This section summarizes the background 
we need for the detailed description of randomized algorithms in §§4-6 and the anal- 
ysis in §§8-11. We introduce notation in §3.1, describe some standard matrix de- 
compositions in §3.2, and briefly review standard techniques for computing matrix 
factorizations in §3.3. 
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3.1. Basic definitions. The standard Hcrmitian geometry for C" is induced by 
the inner product 



{x, y) = X ■ y 
The associated norm is 

— 'J 

We usually measure the magnitude of a matrix A with the operator norm 

\\Ax\\ 



'3 ■' 



\A\\ = 



max ■ 



a!#o ||a;| 

which is often referred to as the spectral norm. The Frobenius norm is given by 



1/2 



The matrix conjugate transpose, or adjoint, of a matrix A is denoted A*. The 
important identity 

\\Af = \\A*A\\ = \\AA*\\ 

holds for each matrix A. 

We say that a matrix U is orthonormal if its columns form an orthonormal set 
with respect to the Hermitian inner product. An orthonormal matrix U preserves 
geometry in the sense that ||C/a3|| = ||a3|| for every vector x. A unitary matrix is 
a square orthonormal matrix, and an orthogonal matrix is a real unitary matrix. 
Unitary matrices satisfy the relations UU* = U* U = 1. Both the operator norm and 
the Frobenius norm are unitarily invariant, which means that 

\\UAV*\\ = \\A\\ and \\UAV*\\p = \\A\\p 

for every matrix A and all orthonormal matrices U and V with the right dimensions. 

3.2. Standard matrix factorizations. This section defines three basic matrix 
decompositions. Throughout, A is an to x n matrix with entries atj. We use the 
notation of [61] to denote submatrices. If / ~ [ii, 12, . . . , ip] and J ~ [ji, j2, • ■ • , jq] 
are two index vectors, then the associated p x q submatrix is expressed as 



For column- and row-submatrices, we use the abbreviations A(. = ^([1, 2, m],j) 
and A(^j^, ) = 2, «])• 

3.2.1. The pivoted QR factorization. Each m x n matrix A admits a de- 
composition 

A = QR, 

where Q is an m x ^ orthonormal matrix, and R is an £ x n weakly upper-triangular 
matrix. That is, there exists a permutation J of the numbers {1, 2, . . . , n} such 
that ,7) is upper triangular. Moreover, the diagonal entries of j) are weakly 
decreasing. Sec [61, §5.4.1] for details. 
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3.2.2. The singular value decomposition (SVD). Each m x n matrix A 
admits a factorization 

A = UHV*, 

where U is an m x £ orthonormal matrix, V is an n x ^ orthonormal matrix, and S 
is an ^ X ^ nonnegative, diagonal matrix 



S = 



0-2 



The numbers aj are called the singular values of A. They are arranged in weakly 
decreasing order: 

O"! > (72 > • • • > cr£ > 0. 

The columns of U and V are called left singular vectors and right singular vectors^ 
respectively. 

Singular values are connected with the approximability of matrices. For each j, 
the number ctj+i equals the spectral-norm discrepancy between A and an optimal 
rank-j approximation [99]. That is, 

(Jj+i = min{||A — B|| : B has rank j}. (3-1) 

In particular, ai = \\A\\. Sec [61, §2.5.3 and §5.4.5] for additional details. 

3.2.3. The interpolative decomposition (ID). Our final factorization iden- 
tifies a collection of k columns from a rank-/c matrix A that span the range of A. To 
be precise, there exists an index set J = [ji, . . . , jk] such that 

A = A^,j^X, 

where X is a k x n matrix that contains the k x k identity matrix 1^. Specifically, 
-X'(. J) = Ife. Furthermore, no entry of X has magnitude larger than one. 

Computing the ID of a general matrix is NP-hard. Even so, if we relax the 
restriction on X so that its entries have magnitude bounded by two, then very efficient 
algorithms are available [29,68]. 

3.3. Techniques for computing standard factorizations. This section dis- 
cusses some established deterministic techniques for computing the factorizations pre- 
sented in §3.2. The material on pivoted QR and SVD can be located in any major text 
on numerical linear algebra, such as [61, 131]. References for the ID include [29,68]. 

3.3.1. Computing the full decomposition. It is possible to compute the full 
QR factorization or the full SVD of an m x n matrix to double-precision accuracy with 
0(mnmin{TO, n}) ffops. Techniques for computing the SVD are iterative by necessity, 
but they converge so fast that we can treat them as finite for practical purposes. 
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3.3.2. Computing partial decompositions. Suppose that an m x n matrix 
has numerical rank fc, where k is substantially smaller than m and n. In this case, 
it is possible to produce a structured low-rank decomposition that approximates the 
matrix well. Sections 4 and 5 describe a set of randomized techniques for obtaining 
these partial decompositions. This section briefly reviews the classical techniques, 
which also play a role in developing randomized methods. 

To compute a partial QR decomposition, the classical device is the Businger- 
Golub algorithm, which performs successive orthogonalization with pivoting on the 
columns of the matrix. We halt the procedure when the Frobenius norm of the 
remaining columns is less than a computational tolerance e. Provided that the or- 
thogonality of the basis is maintained scrupulously, the process usually results in a 
partial factorization 

A = QR + E, (3.2) 

where Q is an m x fc orthonormal matrix, J? is a /c x n weakly upper-triangular matrix, 
and E is an error term that satisfies ||-E||p < e. The computational cost is O(fcmn). 
In practice, k is typically close to the minimal rank for which this decomposition is 
possible, but the Businger-Golub algorithm can fail for pathological matrices. 

Subsequent research has led to strong rank-revealing QR algorithms that suc- 
ceed for all matrices. For example, using O(fcmn) flops, the Gu-Eisenstat algorithm 
produces an QR decomposition of the form (3.2), where 

\\E\\ < ^l + Ak{n-k)-ak+i. 

Recall that ak+i is the minimal error possible in a rank-fc approximation [99]. The 
Gu-Eisenstat algorithm can also be used to obtain an approximate ID in 0{kmn) 
flops [29]. 

To compute an approximate SVD of a general mx n matrix, the most straightfor- 
ward technique is to compute the full SVD and truncate it. This procedure is stable 
and accurate, but it requires 0(mri min{m, ??}) flops. A more efficient approach is to 
compute a partial QR factorization and postprocess the factors to obtain a partial 
SVD using the methods described below in §3.3.3. This scheme takes only O(fcmn) 
flops. Krylov subspace methods can also compute partial SVDs at a comparable cost 
of O(fcTOn), but they are much less robust. 

Note that all the techniques described in this section require extensive random 
access to the matrix, and they can be very slow when the matrix is stored out-of-core. 

3.3.3. Converting from one partial factorization to another. Suppose 
that we have obtained a partial decomposition of a matrix A by some means: 

\\A-CB\\<e, 

where B and C have rank k. Given this information, we can efficiently compute any 
of the basic factorizations. 

We construct a partial QR factorization using the following three steps: 

1. Compute a QR factorization of C so that C — QiRi. 

2. Form the product D = RiB, and compute a QR factorization: D — Q2R- 

3. Form the product Q ~ QiQ2- 

The result is an orthonormal matrix Q and a weakly upper-triangular matrix R such 
that ||A - QR\\ < £. 
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An analogous technique yields a partial SVD: 

1. Compute a QR factorization of C so that C = QiRi. 

2. Form the product D = RiB, and compute an SVD: D = (JaSV*. 

3. Form the product U ~ QiU2- 

The result is a diagonal matrix S and orthonormal matrices U and V such that 
\\A-UY.V*\\ < e. 

Converting B and C into a partial ID is a one-step process: 

1. Compute J and X such that B = Bi^. _j)X. 

Then A « A(. ,j)X, but the approximation error may deteriorate from the initial 
estimate. For example, if we use the Gu-Eisenstat algorithm [68] to compute the 
ID, then ||A — A(. .,j)X\\ < (1 + i/l + Ak{n - k)) ■ e. Compare this bound with 
Lemma 5.1 below. 

3.3.4. Krylov-subspace methods. Suppose that the matrix A can be ap- 
plied rapidly to vectors, as happens when A is sparse or structured. Then Krylov 
subspace techniques can very effectively and accurately compute partial spectral de- 
compositions. For concreteness, assume that A is square. The concept behind these 
techniques is to fix a random starting vector a; and to generate the corresponding 
Krylov subspace 

Vg{u:) = span {a;, Auj, A^uj, . . . , A^^^u;}. 

The algorithms essentially restrict the matrix to the subspace and perform a spectral 
decomposition of the reduced matrix. 

The execution time of Krylov subspace techniques typically satisfies 

Tkryiov ~ k Tmuit + (m + n), (3.3) 

where Tmuit is the cost of a matrix-vector multiplication. In practice, the performance 
of Krylov subspace methods depends heavily on the structure of the singular spectrum 
of the matrix. 



Part II: Algorithms 

This part of the paper, §§4-7, provides detailed descriptions of randomized algo- 
rithms for constructing low-rank approximations to matrices. As discussed in §1.2, we 
split the problem into two stages. In Stage A, we construct a subspace that captures 
the action of the input matrix. In Stage B, we use this subspace to help us obtain an 
approximate factorization of the matrix. 

Section 4 develops randomized methods for completing Stage A, and §5 describes 
deterministic methods for Stage B. Section 6 compares the computational costs of the 
resulting two-stage algorithm with the classical approaches outlined in §3. Finally. §7 
illustrates the performance of the randomized schemes via numerical examples. 

4. Stage A: Randomized schemes for approximating the range. This 
section outlines techniques for constructing a subspace that captures most of the 
action of a matrix. We begin with a recapitulation of the proto-algorithm that we 
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introduced in §1.3. Wc discuss how it can be implemented in practice (§4.1) and then 
consider the question of how many random samples to acquire (§4.2). Afterward, we 
present several ways in which the basic scheme can be improved. Sections 4.3 and 4.4 
explain how to address the situation where the numerical rank of the input matrix is 
not known in advance. Section 4.5 shows how to modify the scheme to improve its 
accuracy when the singular spectrum of the input matrix decays slowly. Finally, §4.6 
describes how the scheme can be accelerated by using a structured random matrix. 

4.1. The proto-algorithm revisited. The most natural way to implement 
the proto-algorithm from §1.3 is to draw a random test matrix n from the standard 
Gaussian distribution. That is, each entry of fl is an independent Gaussian random 
variable with mean zero and variance one. For reference, we formulate the resulting 
scheme as Algorithm 4.1. 



Algorithm 4.1 

Given an m x n matrix A, and an integer i, this scheme computes an m x £ 
orthonormal matrix Q whose range approximates the range of A. 

1 Draw an n X ^ Gaussian random matrix fl. 

2 Form the m x £ matrix Y — AVI. 

3 Construct an m x ^ matrix Q whose columns form an orthonormal basis 
for the range of Y . 



The number Tbasic of flops required by Algorithm 4.1 satisfies 

-^basic ^ 

where Tj-and is the cost of generating a Gaussian random number and T-a^uit is the cost 
of multiplying A by a vector. The three terms in (4.1) correspond directly with the 
three steps of Algorithm 4.1. 

Empirically, wc have found that the performance of Algorithm 4.1 depends very 
little on the quality of the random number generator used in Step 1. 

The actual cost of Step 2 depends substantially on the matrix A and the com- 
putational environment that we are working in. The estimate (4.1) suggests that 
Algorithm 4.1 is especially efficient when the matrix-vector product x ^ Ax can be 
evaluated rapidly. In particular, the scheme is appropriate for approximating sparse 
or structured matrices. Turn to §6 for more details. 

The most important implementation issue arises when performing the basis cal- 
culation in Step 3. Typically, the columns of the sample matrix Y arc almost linearly 
dependent, so it is imperative to use stable methods for performing the orthonor- 
malization. We have found that the Gram-Schmidt procedure, augmented with the 
double orthogonalization described in [12], is both convenient and reliable. Methods 
based on Householder reflectors or Givens rotations also work very well. Note that 
very little is gained by pivoting because the columns of the random matrix Y are 
independent samples drawn from the same distribution. 

4.2. The number of samples required. The goal of Algorithm 4.1 is to pro- 
duce an orthonormal matrix Q with few columns that achieves 



||(i-gg*)A|| <£, 



(4.2) 
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where e is a specified computational tolerance. The number of columns I that the 
algorithm needs to reach this threshold is usually slightly larger than the minimal 
rank k of the smallest basis that verifies (4.2). We call the discrepancy p = £ — k the 
oversampling parameter. The size of the oversampling parameter depends on several 
factors: 

The matrix dimensions. Very large matrices may require more oversampling. 

The singular spectrum. The more rapid the decay of the singular values, the less 
oversampling is needed. In the extreme case that the matrix has exact rank 
fc, it is not necessary to oversample. 

The random test matrix. Gaussian matrices succeed with very little oversampling, 
but they can be difficult to generate precisely. The structured random ma- 
trices discussed in §4.6 may require substantial oversampling, but they still 
yield computational gains in certain settings. 

The theoretical results in Part III provide detailed information about how the 
behavior of randomized schemes depends on these factors. For the moment, we limit 
ourselves to some general remarks on implementation issues. 

For Gaussian test matrices, it is adequate to choose the oversampling parameter 
to be a small constant, such as p = 5 or p = 10. There is rarely any advantage to 
select p > k. This observation from demonstrates that a Gaussian test matrix results 
in a negligible amount of extra computation [94] . 

In practice, the target rank k is rarely known in advance. Randomized algorithms 
are usually implemented to increase the number of samples adaptively until the desired 
tolerance is met. In other words, the user never chooses the oversampling parameter. 
Theoretical results that bound the amount of oversampling are valuable primarily as 
aids for designing algorithms. We develop an adaptive approach in §§4.3-4.4. 

The computational bottleneck in Algorithm 4.1 is usually the formation of the 
product Art. As a result, it often pays to draw a larger number i of samples than 
necessary because the user can minimize the cost of the matrix multiplication with 
tools such as blocking of operations, high-level linear algebra subroutines, parallel 
processors, etc. This approach may lead to an ill-conditioned sample matrix Y , but 
the orthogonalization in Step 3 of Algorithm 4.1 can easily identify the numerical 
rank of the sample matrix and ignore the excess samples. Furthermore, Stage B of 
the matrix approximation process succeeds even when the basis matrix Q has a larger 
dimension than necessary. 

4.3. A posteriori error estimation. Algorithm 4.1 is designed for solving the 
fixed-rank problem, where the target rank of the input matrix is specified in advance. 
To handle the fixed-precision problem, where the parameter is the computational 
tolerance, we need a scheme for estimating how well a putative basis matrix Q captures 
the action of the matrix A. To do so, we develop a probabilistic error estimator. These 
methods are inspired by work of Dixon [41]; our treatment follows [90, 135]. 

The exact approximation error is ||(I — QQ*)A\\. It is intuitively plausible that 
we can obtain some information about this quantity by computing ]|(I — QQ*)Auj\\, 
where a; is a standard Gaussian vector. This notion leads to the following method. 
Draw a sequence {u;'^'^ : i = 1,2, ... ,r} of standard Gaussian vectors, where r is a 
small integer that balances computational cost and reliability. Then 

||(I-gg*)A|| < lOW- max ll(I-gg*)Aa;(*)ll (4.3) 

V TT 2— l,...,r 
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except with probability 10 This statement follows by setting B = {1 QQ*)A 
and a = 1/10 in the following lemma, whose proof appears in [135, §3.4]. 

Lemma 4.1. Let B be a real m x n matrix. Fix a positive integer r and a real 
number a € (0, 1). Draw an independent family {a;'*^ : i = 1, 2, . . . , r} of standard 
Gaussian vectors. Then 

\\B\\ < -X - max llBw^ll 

a V TT i=l,...,r II II 

except with probability . 

The critical point is that the error estimate (4.3) requires only a small number of 
matrix-vector products, hence is computationally cheap. Therefore, we can make a 
lowball guess for the numerical rank of A and add more samples if the error estimate 
is too large. The asymptotic cost of Algorithm 4.1 is preserved if we double our guess 
for the rank at each step. For example, we can start with 32 samples, compute another 
32, then another 64, etc. 

Remark 4.1. The estimate (4.3) is actually somewhat crude. We can obtain a 
better estimate at a similar computational cost by initializing a power iteration with 
a random vector and repeating the process several times [90] . 

4.4. Error estimation (almost) for free. The error estimate described in §4.3 
can be combined with any method for constructing an approximate basis for the range 
of a matrix. In this section, we explain how the error estimator can be incorporated 
into Algorithm 4.1 at almost no additional cost. 

To be precise, let us suppose that A is an m x n matrix and e is a computational 
tolerance. We seek an integer i and an m x i orthonormal matrix Q^^^ such that 

||(I-Q(^)(Q^'^)*)^|| <£• (4-4) 

The size £ of the basis will typically be slightly larger than the size k of the smallest 
basis that achieves this error. 

The basic observation behind the adaptive scheme is that we can generate the 
basis in Step 3 of Algorithm 4.1 incrementally. Starting with an empty basis matrix 
Q^^\ the following scheme generates an orthonormal matrix whose range captures 
the action of A: 

for i = 1, 2, 3, . . . 

Draw an n X 1 Gaussian random vector and set y^*) = A(jJ^^\ 

Compute = (l - Q('-i)(Q(*-i))*) yW. 

Normalize = g^^VH?^'^ || > and form Q^') = [Q('-i) qW]. 
end for 

How do we know when we have reached a basis Q^^^ that verifies (4.4)? The answer 
becomes apparent once we observe that the vectors q*^*^ are precisely the vectors that 
appear in the error bound (4.3). The resulting rule is that we break the loop once we 
observe r consecutive vectors q'*^ whose norms are smaller than e/{10^/2j7r). 

A potential complication is that the vectors q*-'-* become small as the basis starts 
to capture most of the action of A. In finite-precision arithmetic, their direction is 
extremely unreliable. To address this problem, we simply reproject the normalized 
vector q^'^ onto range((5(*~^^)^. 
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A formal description of the resulting algorithm appears as Algorithm 4.2. Its 
CPU time requirement is essentially identical with that of Algorithm 4.1. Note that 
the algorithm computes the last few samples purely to obtain the error estimate; 
this apparent extra cost is offset by the fact that Algorithm 4.1 always includes an 
oversampling factor. The failure probability stated for this scheme is pessimistic 
because it is derived from a simple argument using the union bound. In practice, the 
error estimator is reliable in a range of circumstances when we take r = 10. 





Algorithm 4.2 






Given an m X n matrix A, a tolerance e, and an 


integer r, 


the following 


sch 


erne computes an orthonormal matrix Q such that (1.3) holds with prob- 


ability at least 1 — min{TO,n}10 






1 


Draw standard Gaussian vectors . . . , a;''" 


of length 


n. 


2 


For i = 1, 2, . . . , r, compute y'*) = Atui^*^ . 






3 








4 


Q^^^ — [], the m X empty matrix. 






5 


while max { 2/'^'+^^ , 2/^^'+^^ , • • • , 2/^^'+''^ 


■ > e/{10^ 


/2A), 


6 








7 


Overwrite y(j') by (I - 


yW. 




8 








9 


QU) ^ [Q(J-I) qW]. 






10 


Draw a standard Gaussian vector 0;^^+'" 


of length 


n. 


11 


yU+r) ^ (I _ Q0)(Q0"))*) 






12 


fori = (j + l),(j + 2),...,(i + r-l). 






13 


Overwrite y''' by y'*' — q'^^'> (q^^\ 


yW). 




14 


end for 






15 


end while 






16 









Remark 4.2. The calculations in Algorithm 4.2 can be organized so that each 
iteration processes a block of samples simultaneously. This revision can lead to dra- 
matic improvements in speed because it allows us to exploit BLAS3 (i.e., higher- level 
linear algebra subroutines) or parallel processors. Although blocking can lead to 
the generation of unnecessary samples, this outcome is generally harmless, as noted 
in §4.2. 

4.5. A modified scheme for matrices whose singular values decay slowly. 

The techniques described in §4.1 and §4.4 work well for matrices whose singular values 
exhibit some decay, but they may result in a poor basis when the input matrix has a 
flat singular spectrum or when the input matrix is very large. Our theoretical analysis 
in §10 quantifies these tradeoffs precisely. 

In this section, we develop a remedy suggested in [67, 111] that is related to earlier 
work of [113]. This approach incorporates Krylov-subspace ideas to produce a basis 
with near-optimal accuracy. The intuition is that the small singular vectors interfere 
with the calculation, so we reduce their weight relative to the dominant singular 
vectors by means of a power iteration. 

More precisely, we can apply the randomized sampling scheme to the matrix 
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B ~ {AA*)'^A, where g is a small integer. The matrix B has the same singular 
vectors as the input matrix A, but its singular values decay much more quickly: 

a,{B) ^ a,{Af^+\ j = 1,2,3,.... 

This approach requires 2q+l times as many matrix- vector multiplies as Algorithm 4.1, 
but it is far more accurate. A good heuristic is that, when the original scheme produces 
a basis whose approximation error is within a factor C of the optimum, the power 
scheme produces an approximation error within C'^^'-^''^^^ of the optimum. In other 
words, the power iteration drives the approximation gap to one exponentially fast. 
See Theorem 9.2 and §10.4 for the details. 

In practice, we must adapt this idea to account for the fact that calculations are 
performed in finite-precision arithmetic. In this case, it is advantageous to use the 
sample matrix formed when we retain the results of intermediate calculations: 

An I {AA*)An I ... I {AA*yAn ]. (4.5) 

The resulting scheme is summarized as Algorithm 4.3. Corollary 9.3 and [111] provide 
additional analysis. 



Algorithm 4.3 

Given an mxn matrix A and numbers £, q, this algorithm computes an mx£ 
orthonormal matrix Q whose range approximates the range of A. 

1 Draw ail n X £ standard Gaussian matrix fJ. 

2 Compute the n x {q + 1)£ sample matrix W = [y'-°\ Y'-^\ . . . , Y'-'^'^] , 
where the matrices l^*^*-* are defined by induction: l^C') = Afl, and Y^'^ = 

fori= 1, 2, g. 

3 Form the n x £ matrix Q by taking the first £ steps of a rank-revealing 
QR factorization of Z. 



Algorithm 4.3 targets the fixed-rank problem. To address the fixed-precision prob- 
lem, we can incorporate the error estimators described in §4.3 to obtain an adaptive 
scheme analogous with Algorithm 4.2. In situations where it is critical to achieve near- 
optimal approximation errors, one can increase the oversampling beyond our standard 
recommendation £ = A: -I- 5 all the way to ^ = 2fc without changing the scaling of the 
asymptotic computational cost. A supporting analysis appears in Corollary 10.10. 

4.6. An accelerated technique for general dense matrices. This section 
describes a set of techniques that allow us to compute an approximate rank-^ factor- 
ization of a general dense mxn matrix in roughly 0(TOnlog(£)) flops, in contrast to 
the asymptotic cost 0{mn£) required by earlier methods. We can tailor this scheme 
for the real or complex case, but we focus on the conceptually simpler complex case. 
These algorithms were introduced in [135]; similar techniques were proposed in [118]. 

The first step toward this accelerated technique is to observe that the bottleneck 
in Algorithm 4.1 is the computation of the matrix product AO,. When the test matrix 
n is standard Gaussian, the cost of this multiplication is 0{mn£), the same as a rank- 
revealing QR algorithm [68] . The key idea is to use a structured random matrix that 
allows us to compute the product in 0(mnlog(£)) fiops. 
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The subsampled random Fourier transform, or SRFT, is perhaps the simplest 
example of a structured random matrix that meets our goals. An SRFT is an n x £ 
matrix of the form 



• £> is an ri X n diagonal matrix whose entries are independent random variables 
uniformly distributed on the complex unit circle, 

• is the n X n unitary discrete Fourier transform, whose entries take the 
values fpg = e-27ri(p-i)(9-i)/« for p, g = 1, 2, . . . , n, and 

• i? is an n X ^ matrix that samples £ coordinates from n uniformly at random, 
i.e., its £ columns are drawn randomly without replacement from the columns 
of the n X n identity matrix. 

When rj is defined by (4.6), we can compute the sample matrix Y = Afl using 
0(mnlog(^)) flops via a subsampled FFT [135]. Then we form the basis Q by or- 
thonormalizing the columns of Y, as documented in §4.1. The total number Tgtruct 
of flops required by this procedure is 



This scheme is analogous with Algorithm 4.1. Note that if £ is substantially larger than 
the numerical rank k of the input matrix, we can perform the orthogonalization with 
0{k£n) flops because the columns of the sample matrix are almost linearly dependent. 

The test matrix (4.6) is just one choice among many possibilities. Other sugges- 
tions that appear in the literature include subsampled Hadamard transforms, chains 
of Givens rotations acting on randomly chosen coordinates, and many more. See [88] 
and its bibliography. Empirically, we have found that the transform summarized in 
Remark 4.5 below performs very well in a variety of environments [112]. 

At this point, it is not well understood how to quantify and compare the be- 
havior of structured random transforms. One reason for this uncertainty is that it 
has been difficult to analyze the amount of oversampling that various transforms re- 
quire. Section 11 establishes that the random matrix (4.6) can be used to identify 
a near-optimal basis for a rank-A: matrix using £ {k + log(n)) log(fc) samples. In 
practice, the transforms (4.6) and (4.8) typically require no more over-sampling than 
do a Gaussian random matrix. (For a numerical example, see §7.5.) In consequence, 
setting ^ = /s + 10 or £ = fc + 20 is typically more than adequate. Further research on 
these questions would be valuable. 

Remark 4.3. The structured random matrices discussed in this section do not 
adapt readily to the fixed-precision problem, where the computational tolerance is 
specified, because the samples from the range are usually computed in bulk. Fortu- 
nately, these schemes are sufficiently inexpensive that we can progressively increase 
the number of samples computed starting with ^ = 32, say, and then proceeding to 
£ = 64, 128, 256, . . . until we achieve the desired tolerance. 

Remark 4.4. When using the SRFT (4.6) for matrix approximation, we have 
a choice whether to use a subsampled FFT or a full FFT. The complete FFT is so 
inexpensive that it often pays to construct an extended sample matrix Yjargo ~ ADF 




(4.6) 



where 



Tstruct ~ mn log(^) -I- £^n 



(4.7) 
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and then generate the actual samples by drawing columns at random from YJargo a-nd 
rescaling as needed. The asymptotic cost increases very slightly to 0{mn\og{n)) flops, 
but the full FFT is actually faster for moderate problem sizes because the constant 
suppressed by the big-0 notation is so small. Adaptive rank determination is easy 
because we just examine extra samples as needed. 

Remark 4.5. Among the structured random matrices that we have tried, one of 
the strongest candidates involves sequences of random Givens rotations [112]. This 
matrix takes the form 

n = D" &' D' @DF R, (4.8) 

where the prime symbol ' indicates an independent realization of a random matrix. 
The matrices R, F, and D are defined after (4.6). The matrix © is a chain of random 
Givens rotations: 

= n G(l, 2; Oi) G(2, 3; 62) ■ ■ ■ Gin - 1, n; 9n-i) 

where II is a random n x n permutation matrix; where 0i, . . . , 0„_i are independent 
random variables uniformly distributed on the interval [0, 27r]; and where G{i,j;9) 
denotes a rotation on R" by the angle 9 in the (i, j) coordinate plane [61, §5.1.8]. 

5. Stage B: Construction of standard factorizations. The algorithms for 
Stage A described in §4 produce an orthonormal matrix Q whose range captures the 
action of an input matrix A: 

\\A-QQ*A\\<s, (5.1) 

where e is a computational tolerance. This section describes methods for approximat- 
ing standard factorizations of A using the information in the basis Q. 

To accomplish this task, we pursue the idea from §3.3.3 that any low-rank factor- 
ization A w CB can be manipulated to produce a standard decomposition. When 
the bound (5.1) holds, the low-rank factors are simply C = Q and B = Q*A. The 
simplest scheme (§5.1) computes the factor B directly with a matrix product to ensure 
a minimal error in the final approximation. An alternative approach (§5.2) constructs 
B using only the information in the basis Q. The second method can be faster but 
typically results in larger errors. Both schemes can be streamlined for an Hcrmitian 
input matrix (§5.3). Finally, we develop single-pass algorithms that exploit other 
information generated in Stage A to avoid revisiting the input matrix (§5.4). 

Throughout this section, A denotes an m x n matrix, and Q is an to x A: or- 
thonormal matrix that verifies (5.1). For purposes of exposition, we concentrate on 
methods for constructing the partial SVD. 

5.1. Factorizations based on forming Q*A directly. The relation (5.1) im- 
plies that II A — QB\\ < e, where B = Q*A. Once we have computed B, we can 
produce any standard factorization using the methods of §3.3.3. Algorithm 5.1 illus- 
trates how to build an approximate SVD. 

The factors produced by Algorithm 5.1 satisfy 

\\A - UY,V*\\ < e. (5.2) 
Therefore, the approximation error docs not degrade. 
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Algorithm 5.1 

Given matrices A and Q such that (5.1) holds, this procedure computes an 
approximate SVD A w UT,V* . 

1 Form the matrix B ^ Q* A. 

2 Compute an SVD of the smaU matrix: B — UHV* . 

3 Form the product U = QU. 



The cost of Algorithm 5.1 is generaUy dominated by the cost of the product Q*A 
in Step 1, which takes 0{kmn) flops for a general dense matrix. Note that this scheme 
is particularly well suited to environments where we have a fast method for computing 
the matrix-vector product x i-^ A*x, for example when A is sparse or structured. 
This approach retains a strong advantage over Krylov-subspace methods and rank- 
revealing QR because Step 1 can be accelerated using BLAS3, parallel processors, and 
so forth. Steps 2 and 3 require 0{k'^n) and O(fc^m) flops respectively. 

5.2. Postprocessing via row extraction. Given a matrix Q such that (5.1) 
holds, we can obtain a rank-fc factorization 

A^XB, (5.3) 

where B is a fc x n matrix consisting of k rows extracted from A. The approxima- 
tion (5.3) can be produced without computing any matrix-matrix products, which 
makes this approach to postprocessing very fast. The drawback comes because the er- 
ror \\A — XB\\ is usually larger than the initial error ||A — QQ*A\\, especially when 
the dimensions of A are large. See Remark 5.2 for more discussion. 

To obtain the factorization (5.3), we simply construct the intcrpolative decompo- 
sition (§3.2.3) of the matrix Q: 

Q = XQ(j,y (5.4) 

The index vector J marks k rows of Q that span the row space of Q, and X is an 
m X k matrix whose entries are bounded in magnitude by 2 and contains the k x k 
identity as a submatrix: X(^j . ) = Ifc. Combining (5.4) and (5.1), we reach 

Af=i QQ*A = XQ^ j r,Q*A. (5.5) 

Since -^(j, : ) — Ife, equation (5.5) implies that ) w Q{J,-- )Q*A Therefore, (5.3) 

follows when we put B = A(^j . y 

Provided with the factorization (5.3), we can obtain any standard factorization 
using the techniques of §3.3.3. Algorithm 5.2 illustrates an SVD calculation. This 
procedure requires 0{k'^{m + n)) flops. The following lemma guarantees the accuracy 
of the computed factors. 

Lemma 5.1. Let A be an m x n matrix, and let Q be an m x k matrix that satisfy 
(5.1). Suppose that U , S, and V are the matrices constructed by Algorithm 5.2. Then 



A-UT,V*\\< 1 + ^l + Ak{n-k) e. (5.6) 
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Algorithm 5.2 

Given matrices A and Q such that (5.1) holds, this algorithm computes an 
approximate SVD A « VEV* . 

1 Compute an ID Q = XQ(^j_. y 

2 Extract A(j . ), and compute a QR factorization A{^j . ) ~ R*V* . 

3 Form the product Z = XR\ 

4 Compute an SVD Z = C/SV*. 

5 Form the product V = VV . 



Proof. The factors U , S, V constructed by the algorithm satisfy 

unv* = ui:v*v* = zv* = xR*v* = ). 

Define 

A = QQ*A. (5.7) 

Since A = XQ(^j . )Q* A and since X(^j_. ) = Ij., it must be that A(j . ) = Q(,7_: )Q* A. 
Consequently, 

A = XA^j,,y 

We have the chain of relations 

\\A-Ui:V*\\ = ||A-XA(7,,)|| 

= ||(A-Xl(,,,.)) + (xi(j,.)-XA(j,.))|| 

< \\A- A\\ + \\XA(j^.,)- XA(j^.,^)\\ 

< llA-All (5.8) 

Inequality (5.1) ensures that \\A — A\\ < e. Since j — A(^j . j is a submatrix of 
A — A, wc must also have ||^(,/. : ) — ^(./, : )|| ^ Thus, (5.8) reduces to 

\\A-U'SV*\\<{l + \\X\\)e. (5.9) 

The bound (5.6) follows from (5.9) after we observe that X contains a. k x k identity 
matrix and that the entries of the remaining [n — k) x k submatrix are bounded in 
magnitude by two. □ 

Remark 5.1. To maintain a unified presentation, we have formulated all the 
postprocessing techniques so they take an orthonormal matrix Q as input. Recall 
that, in Stage A of our framework, we construct the matrix Q by orthonormalizing 
the columns of the sample matrix Y . With finite-precision arithmetic, it is preferable 
to adapt Algorithm 5.2 to start directly from the sample matrix Y. To be precise, 
we modify Step 1 to compute X and J so that Y = XY(^j.y This revision is 
recommended even when Q is available from the adaptive rank determination of 
Algorithm 4.2. 

Remark 5.2. As the inequality (5.6) suggests, the factorization produced by 
Algorithm 5.2 is potentially less accurate than the basis that it uses as input. This 
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loss of accuracy is problematic when e is not so small or when kn is large. In such 
cases, it is important to use Algorithm 5.1, which is more costly but docs not amplify 
the error, as shown in (5.2). 

5.3. Postprocessing of an Hermitian matrix. When A is Hermitian, the 
postprocessing becomes particularly elegant. In this case, the columns of Q form 
a good basis for both the column space and the row space of A so that we have 
A w QQ* AQQ* . More precisely, when (5.1) is in force, we have 

\\A QQ*AQQ*\\ =\\A- QQ* A + QQ* A QQ*AQQ*\\ 

<\\A- QQ*A\\ + \\QQ*{A - AQQ*) \\ < 2e. (5.10) 

The last inequality relies on the facts that ||QQ*|| = 1 and that 

\\A-AQQ*\\ = \\{A - AQQ*)*\\ = \\A-QQ*A\\ . 

For Hermitian A, it is more common to compute an eigenvalue decomposition 
than an SVD. We can accomplish this goal by adapting the scheme in §5.1. Form 
B = Q*AQ, and compute its eigendecomposition B = UAU* . Finally, construct the 
product U = QU to obtain an approximate eigenvalue decomposition that satisfies 
II A - L''AL''*|| < 2e. This scheme requires O(fcn^) flops. 

We can also pursue the approach from §5.2, which is faster but less accurate. 
First, compute an ID of Q to obtain a factorization of the form (5.3). This step 
results in the symmetric decomposition A « which can be manipulated 

to obtain an approximate eigenvalue decomposition. The total cost is 0{k^n) flops. 

5.4. Single-pass algorithms. The techniques described in §§5.1-5.3 all require 
us to revisit the input matrix. This may not be feasible in environments where the 
matrix is too large to be stored. In this section, we develop a method that requires 
just one pass over the matrix to construct not only an approximate basis but also a 
complete factorization. Similar techniques appear in [135] and [32]. 

For motivation, we begin with the case where A is Hermitian. Let us recall the 
proto-algorithm from §1.3.3: Draw a random test matrix il; form the sample matrix 
Y = Afl; then construct a basis Q for the range ofY. It turns out that the matrices 
O, Y , and Q contain all the information we need to approximate A. 

To see why, imagine that we could form the matrix B ~ Q*AQ. Postmulti- 
plying this equation by Q*fl, we obtain the identity BQ*n ~ Q*AQQ*fl. The 
relationships AQQ* « A and Ail, = Y show that B must satisfy 

BQ*nKQ*Y. (5.11) 

All three matrices ft, Y, and Q are available, so we can solve (5.11) to obtain the 
matrix B. Then the low-rank factorization A w QBQ* can be converted to an 
eigenvalue decomposition via familiar techniques. Algorithm 5.3 summarizes this 
scheme. 

The cost of this algorithm is 0{k^n) flops. The following theorem demonstrates 
that the accuracy of Algorithm 5.3 hinges on the conditioning of the matrix Q*fl. 

Theorem 5.2. Let A be an nxn Hermitian matrix, let ft be an nx £ matrix, and 
let Q be an orthonormal matrix that verifies (5.1). In Step 1 of Algorithm 5.3, suppose 
that we eompute an approximate solution Sappiox that minimizes the operator-norm 
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Algorithm 5.3 

Given an Hermitian matrix A and an orthonormal matrix Q that verify (5.1), 
a random test matrix H, and a sample matrix Y = AJl, this algorithm 
computes an approximate eigenvalue decomposition A w UAU* . 

1 Use a standard least-squares solver to find an Hermitian matrix Bapprox 
that approximately satisfies the equation Bapprox(Q*J^) ~ Q*Y ■ 

2 Compute the eigenvalue decomposition Bapprox = UAU*. 

3 Form the product U = QU. 



residual. If UAU* is the eigenvalue decomposition of A produced by the algorithm, 
then 

\\A-UAU*\\<2(l+^^)e, (5.12) 

where CTmin denotes the smallest singular value ofQ*n. 

Proof Define B = Q*AQ. Relation (5.10) implies that \\A-QBQ*\\ < 2e. 
Since UAU* = QBapproxQ*, we find that 

\\A-UAU*\\ = \\A-QB„^Q*\\ 

= \\A- QBQ* + Q{B - B„^)Q*\\ 

< \\A-QBQ*\\ + \\B-B,pp,o.\\ 

<2e + \\B-B„x\\. (5.13) 

It remains to bound the second term on the right-hand side of (5.13). 

Let F denote the residual error in solving the relation in Step 1 of Algorithm 5.3 
so that 

Q*Y ^ BappioxQ*^^ + F 

We have the chain of relations 

ll-B — -Bapproxll < II (-B — -Bapprox)Q*J^| 

f^min 

= ^ \\BQ*^l - Q*Y + F\\ 

= \\Q*AQQ*n - Q*An + F\\ 

< ^(||Q*||||AQQ*-A||p|| + ||F||) 

< ^— (£||n|| + ||F||). (5.14) 

In order to bound ||-F||, recall that Bapprox is a minimum-residual solution to the 
relation -Bapprox(Q*i^) ~ Q*Y. Therefore, 

||F|| = ||Q*r-BapproxQ*rj|| 

< \\Q*Y BQ*n\\ = \\Q*Ail Q*AQQ*n\\ 

= \\Q*A{I - QQ*)n\\ < \\A{I - QQ*)\\ \M < e ||fi|| . (5.15) 
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Together, (5.13), (5.14), and (5.15) imply the result (5.12). □ 

We have checked enipirically that Algorithm 5.3 is effective when the test matrix 
is Gaussian and the number of samples £ is much smaller than the dimension n 
of the input matrix A. In this situation, |ir2|| -y/n, and experiments indicate that 
Cmin ~ v^i so the error bound (5.12) is quite strong. We have not established a 
rigorous result on the size of this minimum singular value, but this type of estimate 
is not essential in practice because wc can inexpensively compute (Tmin and j|r2j| to 
evaluate the bound on the approximation error. 

If one encounters a situation where (Tmin is small, then Algorithm 5.3 may fail. 
In this case, it may be necessary to use one of the schemes from §5.3 to produce the 
low-rank factorization or to start the approximation process again from scratch. In a 
streaming environment, unfortunately, the game is over. 

When A is not Hcrmitian, it is still possible to devise single-pass algorithms, but 
wc must modify the initial Stage A of the approximation framework to simultaneously 
construct bases for the ranges of A and A*: 

1. Generate random matrices n and CI. 

2. Compute Y = ACl and Y = A*n in a single pass over A. 

3. Compute QR factorizations Y — QR and Y — QR. 

This procedure results in matrices Q and Q such that A w QQ* AQQ* . The reduced 
matrix we must approximate is B ~ Q*AQ. In analogy with (5.11), we find that 

Q*Y = Q*An w Q*AQQ*n = BQ*ft. (5.16) 

An analogous calculation shows that B should also satisfy 

Q*Y K B*Q*n. (5.17) 

Now, the reduced matrix JBapprox can be determined by finding a minimum-residual 
solution to the system of relations (5.16) and (5.17). 

6. Computational costs. So far, we have postponed a detailed discussion of 
the computational cost of randomized matrix approximation algorithms because it is 
necessary to account for both the first stage, where wc compute an approximate basis 
for the range (§4), and the second stage, where we postprocess the basis to complete 
the factorization (§5). We are now prepared to compare the cost of the two-stage 
scheme with the cost of traditional techniques. 

Choosing an appropriate algorithm, whether classical or randomized, requires us 
to consider the properties of the input matrix. To provide a nuanced picture, we 
consider three different computational environments in §6.1-6.3. We close with some 
comments on parallel implementations in §6.4. 

For concrctcness, we focus on the problem of computing an approximate SVD of 
an m X n matrix A with numerical rank k. The costs for other factorizations are 
similar. 

6.1. General matrices that fit in core memory. Suppose that A is a general 
matrix presented as an array of numbers that fits in core memory. In this case, the 
appropriate method for Stage A is to use a structured random matrix (§4.6), which 
allows us to find a basis that captures the action of the matrix using 0(mrilog(fc) -I- 
fc^m) flops. For Stage B. we apply the row-extraction technique (§5.2), which costs 
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an additional 0(fc^(m + n)) flops. The total number of operations Tj-andoni for this 
approach satisfies 

^random ~ mnlog(fc) + fc^(m + n). 

A heuristic error bound for this algorithm is 

\\A-U^V*\\<n-<Jk+i, (6.1) 

where Cfc+i is the (fc + l)th singular value of A. The estimate (6.1), which follows 
from Theorem 11.2 and Lemma 5.1, reflects the worst-case scenario; actual errors are 
usually smaller. 

This algorithm should be compared with modern deterministic techniques, such 
as rank-revealing QR followed by postprocessing (§3.3.2) which requires 

Trrqr ^ kmn 

operations to achieve a comparable error. 

In this setting, the randomized algorithm is faster than classical techniques for 
problems of moderate size, say m, n ~ 10'^ and k ~ 10^. See §7.5 for numerical 
evidence. 

Remark 6.1. In case row extraction is impractical, there is an alternative 
0(mri log(fc)) technique described in [135, §5.2]. When the error (6.1) is unacceptably 
large, we can use the direct method (§5.1) for Stage B, which brings the total cost to 
O(fcmn) flops. 

6.2. Matrices for which matrix vector products can be rapidly evalu- 
ated. In many problems in data mining and scientific computing, the cost Tmuit of 
performing the matrix-vector multiplication x l—^ Ax is substantially smaller than 
the nominal cost 0(mn) for the dense case. It is not uncommon that 0(m -I- n) fiops 
suffice. Standard examples include (i) very sparse matrices; (ii) structured matrices, 
such as Toplitz operators, that can be applied using the FFT or other means; and 
(iii) matrices that arise from physical problems, such as discretized integral operators, 
that can be apphed via, e.g., the fast multipole method [66]. 

Suppose that both A and A* admit fast multiplies. The appropriate randomized 
approach for this scenario completes Stage A using Algorithm 4.1 with p constant 
(for the fixed-rank problem) or Algorithm 4.2 (for the fixed-precision problem) at a 
cost of 0(fcT„iuit + k^m) flops. For Stage B, we invoke Algorithm 5.1, which requires 
0(fc TJiiuit + k'^{m + n)) flops. The total cost Tgparso satisfies 

JlparsG fcT'mult + fc^(m + n). (6.2) 

A heuristic error bound is 

\\A-UT.V*\\ < Vfc^-dfe+i. 

This estimate follows from Corollary 10.9 and the discussion in §5.1. 

When the singular spectrum of A decays slowly, we can incorporate q iterations 
of the power method (Algorithm 4.3) to obtain superior solutions to the fixed-rank 
problem. The computational profile is similar to (6.2), but the heuristic error bound 
improves to 

\\A - UY,V*\\ < (fcn)i/2(29+i) . 
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This estimate takes into account the discussion in §10.4. The power scheme can also 
be adapted for the fixed-precision problem (§4.5). 

In this setting, the classical technique for obtaining a partial SVD employs a 
Krylov-subspace method (§3.3.4), whose computational cost matches (6.2), assuming 
proper convergence. Unfortunately, the performance of these methods can depend 
substantially on whether the singular values of the matrix are tightly clustered (bad), 
spread out (good), or repeated (better). 

Sampling algorithms have two advantages over traditional techniques. First, the 
performance of the randomized scheme does not depend on the fine structure of the 
singular spectrum. More important, we have tremendous freedom to organize the 
computation. For the fixed-rank problem, we can compute all 0(fc) matrix-vector 
products simultaneously; for the fixed-precision problem, we can achieve substantial 
savings by blocking the computation. Similar economies obtain in the postprocessing 
of Stage B. In contrast, Krylov methods necessarily perform all the matrix- vector 
multiplies in serial. 

Remark 6.2. The power scheme. Algorithm 4.3, is conceptually related to 
Krylov-subspace methods initialized with a random starting vector. To explain this 
point succinctly, we assume that A is square. Krylov techniques implicitly compress 
the matrix to the subspace 

Vq{u}) = span{w, Au}, A'^uj, . . . , A'^~^u:} 

and perform spectral computations on the reduced matrix. In comparison, the power 
scheme explicitly compresses the matrix to a subspace formed using many random 
starting vectors 

W,(a;) - span {Vg{u:^% V^iu:^''^), . . . , V,(a;W)} 

and performs spectral computations on the reduced matrix. In the power scheme, the 
exponent q is usually much smaller than for traditional Krylov methods, while the 
number i of random vectors is substantial. 

We view the power scheme as a hybrid that enjoys the best features of both 
randomized algorithms and Krylov-subspace methods. Exploring this connection may 
lead to further advances. See [80, 87] for an analysis of Krylov-subspace methods with 
a random start and [111, 113] for work on the randomized power scheme. 

6.3. General matrices stored in slow memory or streamed. The tradi- 
tional metric for numerical algorithms is the number of floating-point operations they 
require. When the data does not fit in fast memory, however, the computational time 
is often dominated by the cost of memory access. In this setting, a more appropri- 
ate measure of algorithmic performance is pass- efficiency, which counts how many 
times the data needs to be cycled through fast memory. Flop counts become largely 
irrelevant. 

All the classical matrix factorization techniques that we discuss in §3.2 — including 
dense SVD, rank-revealing QR, Krylov methods, and so forth — require at least k 
passes over the the matrix, which is prohibitively expensive for huge data matrices. A 
desire to reduce the pass count of matrix approximation algorithms served as one of 
the early motivations for developing randomized schemes [46, 58, 105]. Detailed recent 
work appears in [32]. 

For many matrices, randomized techniques can produce an accurate approxima- 
tion using just one pass over the data. For Hermitian matrices, we obtain a single-pass 
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algorithm by combining Algorithm 4.1 to construct an approximate basis with Algo- 
rithm 5.3, which produces an eigenvalue decomposition without any additional access 
to the matrix. Section 5.4 describes the analogous technique for general matrices. 

For the huge matrices that arise in information science problems, it is common 
that the singular spectrum decays slowly. Relevant applications include image pro- 
cessing (see §§7.3-7.4 for numerical examples), statistical data analysis, and network 
monitoring. To compute approximate factorizations in these environments, it is cru- 
cial to enhance the accuracy of the randomized approach using the power scheme, 
Algorithm 4.3, or some other device. This approach increases the pass count some- 
what, but four to eight passes are usually sufficient. 

6.4. Gains from parallelization. As mentioned in §§6.2-6.3, randomized meth- 
ods often outperform classical techniques not because they involve fewer floating-point 
operations but rather because they allow us to reorganize the calculations to exploit 
the matrix properties and the computer architecture more fully. In addition, these 
methods are well suited for parallel implementation. For example, in Algorithm 4.1, 
the computational bottleneck is the evaluation of the matrix product Aft, which is 
embarrassingly parallelizable. 

7. Numerical examples. At this point, a question presents itself. Do these 
randomized matrix approximation algorithms actually work in practice? In this sec- 
tion, we attempt to address this concern by illustrating how the algorithms perform 
on a diverse collection of test cases. 

Let us begin with some problems from the physical sciences. First, we consider two 
matrices with exponentially decaying spectra that arise from discretizations of integral 
operators (§7.1). Then we attempt to approximate the inverse of a finite-difference 
operator, where our only access to the matrix is a "matrix-vector multiply" that is 
accomplished by solving a linear system (§7.2). 

Afterward, we turn to some examples from the information sciences. Section 7.3 
investigates how well we can approximate a large graph Laplacian that describes 
the local geometry of an image. We also evaluate the pass-efficient algorithms on 
an enormous database of face images that requires nearly 5.6 GB of storage in its 
uncompressed form (§7.4). 

Finally, we investigate the performance of randomized methods based on struc- 
tured random matrices (§7.5). 

Sections 7.1-7.4 focus on the algorithms for Stage A that we presented in §4 
because we wish to isolate the performance of the randomized step. The eigenface 
problem, however, is so massive that the methods in §5 for Stage B provide the only 
practical way to assess the performance! 

7.1. Matrices associated v^rith compact integral operators. We study the 
behavior of the adaptive range approximation method. Algorithm 4.2, by applying it 
to two matrices we obtained by discretizing compact integral operators that arise in 
potential theory. The singular spectrum of each operator decays exponentially fast, 
which is a very favorable environment for random sampling schemes. 

We first consider a 200 x 200 matrix A resulting from the discretization of the 
following single-layer operator associated with the Laplace equation: 



[S'cr](x) = const • / log\x~y\ a{y)dA{y), x G Fa, (7.1) 
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where Fi and r2 are the two contours in illustrated in Figure 7.1(a). We ap- 
proximate the integral with the trapezoidal rule, which converges superalgebraically 
because the kernel is smooth. In the absence of floating-point errors, we estimate that 
the discretization error would be less than 10^^" for a smooth source a. The leading 
constant is selected so the matrix A has unit operator norm. 

We implement Algorithm 4.2 in Matlab v6.5. Gaussian test matrices are generated 
using the randn command. For each number i of samples, we compare the following 
three quantities: 

1. The minimum rank-^ approximation error (Jij^\ is determined using svd. 

2. The actual error = || (I — A|| is computed with norm. 

3. A random estimator for the actual error eg is obtained from (4.3), with the 
parameter r set to 5. 

Note that any values less than 10^^^ should be considered numerical artifacts. 

Figure 7.2 tracks a characteristic execution of Algorithm 4.2. We make three 
observations: (i) The error eg incurred by the algorithm is remarkably close to the 
theoretical minimum ag+\. (ii) The error estimate always produces an upper bound 
for the actual error. Without the built-in lOx safety margin, the estimate would 
track the actual error almost exactly, (iii) The basis constructed by the algorithm 
essentially reaches full double-precision accuracy. 

How typical is the trial documented in Figure 7.2? To answer this question, we 
examine the empirical performance of the algorithm over 2,000 independent trials. 
Figure 7.3 charts the error estimate versus the actual error at four points during the 
course of execution: € = 25, 50, 75, 100. We offer four observations: (i) The initial 
run detailed in Figure 7.2 is entirely typical, (ii) Both the actual and estimated error 
concentrate about their mean value, (iii) The actual error drifts slowly away from the 
optimal error as the number i of samples increases, (iv) The error estimator is always 
pessimistic by a factor of about ten, which means that the algorithm never produces 
a basis with lower accuracy than requested. The only effect of selecting an unlucky 
sample matrix n is that the algorithm proceeds for a few additional steps. 

We repeat the same experiments for a 400 x 400 complex matrix B obtained when 
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we discretizc an integral operator inspired by acoustic scattering theory: 

[Sa]{x) = const ■ f H^^\k\x - y\) <j{y) dA{y), XGT2, (7.2) 

where TJq^' is the zeroth-order Hankel function of the first kind. The contours Fi and 
r2 are depicted in Figure 7.1(a). As the diagram shows, the diameter L of the inner 
contour F2 shghtly exceeds two, and we set the Helmholtz parameter k = 30. The 
leading constant is chosen so that the resulting matrix B has unit operator norm. 
The physics of the problem ensure that the operator S has O(fcL) singular values of 
order one, after which the spectrum decays rapidly. 

Figure 7.4 documents a typical test run, and Figure 7.5 compiles statistics for 
2,000 trials. Qualitatively, the algorithm behaves the same as for the Laplace problem. 

7.2. Approximating the inverse of a finite-difference operator. Next, we 
consider a situation where the input matrix A is (a restriction of) the inverse of a very 
large, sparse matrix obtained from a finite-difference approximation to a continuous 
differential operator. We have access to the matrix only through the "matrix-vector 
multiply" u} i—f Aw, which is computed by solving a sparse linear system. 

More precisely, we model an electrostatics problem on the square lattice network 
shown in Figure 7.1(b). The electric potential at the inner boundary nodes (red) is 
fixed at the values specified in the data vector Winner- The outer boundary (blue) is 
insulated. The electric potential of each internal node (black) is the average of the 
potentials at its four nearest neighbors. The mathematical model for this system is the 
discrete Laplace equation (using the five-point stencil) , coupled with a zero Neumann 
boundary condition on the exterior nodes and a Dirichlet boundary condition specified 
by Winner On the interior nodes. The matrix A that we seek to approximate comes 
from the linear map 

A. . Winner ' ^ Woutcr 

that describes the potential induced on the exterior nodes by the interior potential. 

For our experiments, we fix a number n and form the lattice with 4n nodes along 
the inner boundary and 12n nodes along the outer boundary. We form the discrete 
Laplace operator for the resulting donut-shaped lattice with roughly 8n^ nodes, and 
we complete each evaluation uj >—>■ Auj using the Matlab backslash operator, which 
yields a residual error below 10^^^. 

We apply Algorithm 4.2 to the operator A obtained when n ~ 133, and we 
repeat the experimental regimen outlined in §7.1. The results of a typical trial appear 
in Figure 7.6. Qualitatively, the performance matches the experiments in §7.1. 
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Fig. 7.2. Approximating a Laplace integral operator. One execution of Algorithm 4. 2 for the 
200 X 200 input matrix A described in §7. The number I of random samples varies along the 
horizontal axis; the vertical axis measures the base-10 logarithm of error magnitudes. The dashed 
vertical lines mark the points during execution at which Figure 7.3 provides additional statistics. 
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Fig. 7.3. Error statistics for approximating a Laplace integral operator. 2,000 trials of Al- 
gorithm 4.2 applied to a 200 X 200 matrix approximating the integral operator (7.1). The panels 
isolate the moments at which I = 25, 50, 75, 100 random samples have been drawn. Each solid point 
compares the estimated error fi versus the actual error ei in one trial; the open circle indicates the 
trial detailed in Figure 7.2. The dashed line identifies the minimal error cri^i, and the solid line 
marks the contour where the error estimator would equal the actual error. 
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Fig. 7.4. Approximating a Holmholtz integral operator. One execution of Algorithm 4-^ lor 
the 400 X 400 input matrix B described in §7. i. See Figure 7.2 for notations. 
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Fig. 7.5. Error statistics for approximating a Helmholtz integral operator. 2,000 trials of 
Algorithm 4-2 applied to a400x400 matrix approximation the integral operator (7.2). See Figure 7.3 
for notations. 
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Fig. 7.6. Approximating tlic inverse of a discrete Laplacian. One execution of Algorithm 4. 2 
for the 1596 X 532 input matrix A described in §7.S. See Figure 7.2 for notations. 

7.3. A large, sparse, noisy matrix arising in image processing. Our next 
example involves a matrix that arises in image processing. A recent line of work uses 
information about the local geometry of an image to develop promising new algorithms 
for standard tasks, such as denoising, inpainting, and so forth. These methods are 
based on approximating a graph Laplacian associated with the image. The dominant 
eigenvectors of this matrix provide "coordinates" that help us smooth out noisy image 
patches [120,130]. 

We begin with a 95 x 95 pixel grayscale image. The intensity of each pixel is 
represented as an integer in the range to 4, 095. We form for each pixel i a vector 
g ]g25 j-,y gathering the 25 intensities of the pixels in a 5 x 5 neighborhood 
centered at pixel i (with appropriate modifications near the edges). Next, we form 
the 9,025 x 9,025 weight matrix W that reflects the similarities between patches: 

Wij =cxp{ - ||a3**' - a;*^^ ll^/o-^}, 

where the parameter ct = 50 controls the level of sensitivity. We obtain a sparse 
weight matrix W by zeroing out all entries in W except the seven largest ones in 
each row. The object is then to construct the low frequency eigenvectors of the graph 
Laplacian matrix 

L = 1- D~^^^WD~^/^, 

where D is the diagonal matrix with entries da = Wij . These are the eigenvectors 
associated with the dominant eigenvalues of the auxiliary matrix A = D~'^/'^W D~'^/'^ . 

The matrix A is very large, and its eigenvalues decay slowly so we must use the 
power scheme summarized in Algorithm 4.3 to approximate it. Figure 7.7(lcft) illus- 
trates how the approximation error declines as the number € of samples increases. 



RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 



41 



When we set the exponent q — 0, which corresponds with the basic Algorithm 4.1, 
the approximation is rather poor. The graph illustrates that increasing the exponent 
q slightly results in a tremendous improvement in the accuracy of the power scheme. 

Next, we illustrate the results of using the two-stage approach to approximate 
the eigenvalues of A. In Stage A, we construct a basis for A using Algorithm 4.3 
with £ = 100 samples for different values of q. In Stage B, wc apply the Hcrmitian 
variant of Algorithm 5.1 described in §5.3 to compute an approximate eigenvalue 
decomposition. Figure 7.7(right) shows how the approximate eigenvalues compare 
with the actual eigenvalues of A. Once again, we see that the minimal exponent q = 
produces miserable results, but the largest eigenvalues are already quite accurate when 
we g = 1. 




Fig. 7.7. Approximating a graph Laplacian. For varying exponent q, one trial of the power 
scheme, Algorithm 4-3, applied to the 10,000 X 10,000 matrix A described in §7.5. (Left) Approx- 
imation errors as a function of the number I of random samples. (Right) Estimates for the 100 
largest eigenvalues given £ = 100 random samples compared with the 100 largest eigenvalues of A. 

7.4. Eigenfaces. Our next example involves a large, dense matrix derived from 
the FERET databank of face images [107, 108]. A simple method for performing face 
recognition is to identify the principal directions of the image data, which are called 
eigenfaces. Each of the original photographs can be summarized by its components 
along these principal directions. To identify the subject in a new picture, we compute 
its decomposition in this basis and use a classification technique, such as nearest 
neighbors, to select the closest image in the database [122]. 

We construct a data matrix A in the following manner. The FERET database 
contains 7, 254 photographs, and each 384 x 256 image contains 98, 304 pixels. First, 
we build an auxiliary 98, 304 x 7, 254 matrix A whose columns are the images. We 
form A by centering each column of A and scaling it to unit norm, so that the images 
are roughly comparable. The eigenfaces arc the dominant left singular vectors of this 
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matrix. 

Our goal then is to compute an approximate SVD of the matrix A. Represented 
as an array of double-precision real numbers. A would require 5.4 GB of storage, which 
does not fit within the fast memory of many machines. It is possible to compress the 
database down to at 57MB or less (in JPEG format), but then the data would have 
to be uncompressed with each sweep over the matrix. Furthermore, the matrix A has 
slowly decaying singular values, so we need to use the power scheme. Algorithm 4.3, 
to capture the range of the matrix accurately. 

To address these concerns, we implemented the power scheme to run in a pass- 
efficient manner. An additional difhculty arises because the size of the data makes it 
prohibitively expensive to calculate the actual error ei incurred by the approximation 
or to determine the minimal error cr^+i. To estimate the errors, we use the technique 
described in Remark 4.1. 

Figure 7.8 describes the behavior of the power scheme, which is similar to its 
performance for the graph Laplacian in §7.3. When the exponent q = 0, the ap- 
proximation of the data matrix is very poor, but it improves quickly as q increases. 
Likewise, the estimate for the spectrum of A appears to converge rapidly; the largest 
singular values are already quite accurate when q = \. We see essentially no improve- 
ment in the estimates after the first 3-5 passes over the matrix. 



Approximation error eg 




Estimated Singular Values cr 



100 




Fig. 7.8. Computing eigenfacos. For varying exponent q, one trial of the power scheme, 
Algorithm 4-3, applied to the 98,304 X 7,254 matrix A described in §7.4- (Left) Approximation 
errors as a function of the number I of random samples. The red line indicates the minimal errors 
as estimated by the singular values computed using t = 100 and q = Z. (Right) Estimates for the 
100 largest eigenvalues given £ = 100 random samples. 



7.5. Performance of structured random matrices. Our final set of experi- 
ments illustrates that the structured random matrices described in §4.6 lead to matrix 
approximation algorithms that are both fast and accurate. 
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First, wc compare the computational speeds of four methods for computing an 
approximation to the £ dominant terms in the SVD of an n x n Gaussian matrix. 



Method 


Stage A 


Stage B 


direct 


Rank-reveaUng QR executed using column 
pivoting and Householder reflectors 


Algorithm 5.1 


gauss 


Algorithm 4.1 with a Gaussian random matrix 


Algorithm 5.1 


srft 


Algorithm 4.1 with the modified SRFT (4.8) 


Algorithm 5.2 


svd 


Full SVD with LAPACK routine dgesdd 


Truncate to £ terms 



Applying these algorithms to approximate a Gaussian matrix represents no loss of 
generality because the size of the input matrix is the only factor relevant to the 
runtime. See Remark 7.1 for implementation details. 

Table 7.1 lists the measured runtime of a single execution of each algorithm for 
various choices of the dimension n of the input matrix and the rank £ of the approx- 
imation. Of course, the cost of the full SVD does not depend on the number £ of 
components required. 

A more informative way to look at the runtime data is to compare the relative 
cost of the algorithms. The direct method is the best deterministic approach for 
dense matrices, so we calculate the factor by which the randomized methods improve 
on this benchmark. Figure 7.9 displays the results. We make two observations: (i) 
Using an SRFT often leads to a dramatic speed-up over classical techniques, even 
for moderate problem sizes, (ii) Using a standard Gaussian test matrix typically 
leads to a moderate speed-up over classical methods, primarily because matrix-matrix 
multiplication is more efficient than rank-revealing QR. 

Second, we investigate how the choice of random test matrix infiuences the error 
in approximating an input matrix. For these experiments, we return to the 200 x 200 
matrix A defined in Section 7.1. Consider variations of Algorithm 4.1 obtained when 
the random test matrix Jl is drawn from the following four distributions: 

Gauss: The standard Gaussian distribution. 

Ortho: The uniform distribution on n x £ orthonormal matrices. 

SRFT: The SRFT distribution defined in (4.6). 

GSRFT: The modified SRFT distribution defined in (4.8). 

Intuitively, we expect that Ortho should provide the best performance. 

For each distribution, we perform 100,000 trials of the following experiment. Ap- 
ply the corresponding version of Algorithm 4.1 to the matrix A, and calculate the 
approximation error ei = \\A — QiQ'^A\\. Figure 7.10 displays the empirical proba- 
bility density function for the error ei obtained with each algorithm. We offer three 
observations: (i) The SRFT actually performs slightly better than a Gaussian ran- 
dom matrix for this example, (ii) The standard SRFT and the modified SRFT have 
essentially identical errors, (iii) There is almost no difference between the Gaussian 
random matrix and the random orthonormal matrix in the first three plots, while the 
fourth plot shows that the random orthonormal matrix performs better. This behav- 
ior occurs because, with high probability, a tall Gaussian matrix is well conditioned 
and a (nearly) square Gaussian matrix is not. 



Table 7.1 

Computational times for a partial SVD. The time, in seconds, required to compute the I 
leading components in the SVD of an nxn matrix using each of the methods from §7.5. The last 
row indicates the time needed to obtain a full SVD. 
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III 

Fig. 7.9. Acceleration factor. The relative cost of computing an l-term partial SVD of an nxn 
Gaussian matrix using direct, a benchmark classical algorithm, versus each of the three competitors 
described in §7.5. The solid red curve shouis the speedup using an SRFT test matrix, and the dotted 
blue curve shouis the speedup with a Gaussian test matrix. The dashed green curve indicates that a 
full SVD computation using classical methods is substantially slower. Table 7.1 reports the absolute 
runtimes that yield the circled data points. 

Remark 7.1. The running times reported in Table 7.1 and in Figure 7.9 depend 
strongly on both the computer hardware and the coding of the algorithms. The ex- 
periments reported here were performed on a standard office desktop with a 3.2 GHz 
Pentium IV processor and 2 GB of RAM. The algorithms were implemented in For- 
tran 90 and compiled with the Lahey compiler. The Lahey versions of BLAS and 
LAPACK were used to accelerate all matrix-matrix multiplications, as well as the 
SVD computations in Algorithms 5.1 and 5.2. We used the code for the modified 
SRFT (4.8) provided in the publicly available software package id_dist [93]. 
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Fig. 7.10. Empirical probability density functions for the error in Algorithm 4.1. As described 
in § 7. 5, the algorithm is implemented with four distributions for the random test matrix and used 
to approximate the 200 X 200 input matrix obtained by discretizing the integral operator (7.1). The 
four panels capture the empirical error distribution for each version of the algorithm at the moment 
when I = 25, 50, 75, 100 random samples have been drawn. 
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Part III: Theory 

This part of the paper, §§8-11, provides a detailed analysis of randomized sam- 
pling schemes for constructing an approximate basis for the range of a matrix, the task 
we refer to as Stage A in the framework of §1.2. More precisely, we assess the qual- 
ity of the basis Q that the proto-algorithm of §1.3 produces by establishing rigorous 
bounds for the approximation error 

- QQ*A\l , (7.3) 

where |||'||| denotes either the spectral norm or the Frobenius norm. The difficulty in 
developing these bounds is that the matrix Q is random, and its distribution is a 
complicated nonlinear function of the input matrix A and the random test matrix ft. 
Naturally, any estimate for the approximation error must depend on the properties 
of the input matrix and the distribution of the test matrix. 

To address these challenges, we split the argument into two pieces. The first part 
exploits techniques from linear algebra to deliver a generic bound for the error that 
depends on the interaction between the test matrix H and the right singular vectors 
of the input matrix A, as well as the tail singular values of A. In the second part of 
the argument, we take into account the distribution of the random matrix to estimate 
the error for specific instantiations of the proto-algorithm. This bipartite proof is 
common in the literature on randomized linear algebra, but our argument is most 
similar in spirit to [17]. 

Section 8 surveys the basic linear algebraic tools we need. Section 9 uses these 
methods to derive a generic error bound. Afterward, we specialize this results to 
case where the test matrix is Gaussian (§10) and the case where the test matrix is a 
subsampled random Fourier transform (§11). 

8. Theoretical preUminaries. We proceed with some additional background 
from linear algebra. Section 8.1 sets out properties of positive-semidefinite matrices, 
and §8.2 offers some results for orthogonal projectors. Standard references for this 
material include [11,72]. 

8.1. Positive semidefinite matrices. An Hcrmitian matrix M is positive 
semidefinite (briefly, psd) when u*Mu > for all u ^ 0. If the inequalities are 
strict, M is positive definite (briefly, pd). The psd matrices form a convex cone, 
which induces a partial ordering on the linear space of Hcrmitian matrices: M ^ N 
if and only if TV — M is psd. This ordering allows us to write M :>= to indicate that 
the matrix M is psd. 

Alternatively, we can define a psd (resp., pd) matrix as an Hcrmitian matrix with 
nonnegative (resp., positive) eigenvalues. In particular, each psd matrix is diagonal- 
izable, and the inverse of a pd matrix is also pd. The spectral norm of a psd matrix 
M has the variational characterization 

u*Mu , , 

LM =max , 8.1) 

" " u*u ^ ' 

according to the Rayleigh-Ritz theorem [72, Thm. 4.2.2]. It follows that 

M^N =^ ||M||<||Ar||. (8.2) 



A fundamental fact is that conjugation preserves the psd property. 
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Lemma 8.1 (Conjugation Rule). Suppose that M )^ 0. For every A, the matrix 
"MA )p 0. In particular, 

M 4N =^ A*MA 4 A*NA. 



Our argument invokes the conjugation rule repeatedly. As a first application, we 
establish a perturbation bound for the matrix inverse near the identity matrix. 

Proposition 8.2 (Perturbation of Inverses). Suppose that M Then 

I-(I + M)-i 4M 



Proof. Define R = M^/^, the psd square root of M promised by [72, Thm. 7.2.6]. 
We have the chain of relations 

I - (I + R^)-^ = (I + R^y^R^ = R{I + R^y^R 4 R^ 

The first equality can be verified algebraically. The second holds because rational 
functions of a diagonalizable matrix, such as i2, commute. The last relation follows 
from the conjugation rule because (I + R?)~^ ^ I. □ 

Next, we present a generalization of the fact that the spectral norm of a psd 
matrix is controlled by its trace. 

Proposition 8.3. We have \\M\\ < \\A\\ + \\C\\ for each partitioned psd matrix 



Proof. The variational characterization (8.1) of the spectral norm implies that 



||M|| = sup 

+ = 1 



X 




A 


B 




X 


y. 




B* 


C 




y. 



< sup (||A||||a;f + 2i|B||l|a;||||y|| + !|C||||y||'). 
\M"- + \\y\\^=i 

The block generalization of Hadamard's psd criterion [72, Thm. 7.7.7] states that 

< ||A|| ||C||. Thus, 

||M|| < sup {\\A\\'/' \\x\\ + ||C||i/2 llylD' = + ||C|| . 

This point completes the argument. □ 

8.2. Orthogonal projectors. An orthogonal projector is an Hermitian matrix 
P that satisfies the polynomial = P. This identity implies =^ P ^ I. An 
orthogonal projector is completely determined by its range. For a given matrix M, 
we write Pm for the unique orthogonal projector with range(PM) = range(iW). 
When M has full column rank, we can express this projector explicitly: 

Pm ^ M{M*M)-^M*. (8.3) 
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The orthogonal projector onto the complementary subspace. range(P)^, is the matrix 
I P. Our argument hinges on several other facts about orthogonal projectors. 

Proposition 8.4. Suppose U is unitary. Then U*PmU = Pvm- 

Proof. Abbreviate P = U*PmU. It is clear that P is an orthogonal projector 
since it is Hermitian and P^ = P. Evidently, 

range(P) = U* rangc(M) rangc(C/*M). 

Since the range determines the orthogonal projector, we conclude P ~ Pu*m- D 

PROPOSITION 8.5. Suppose range(A/") C range(iVf). Then, for each matrix A, it 
holds that \\PnA\\ < \\PmA\\ and that - Pm)A\\ < || (I - Pjv)A|| . 

Proof. The projector Pjv =^ I, so the conjugation rule yields PmPnPm ^ Pm- 
The hypothesis rangc(iV) C rangc(7Vi') implies that PmPn = -Pjv, which results in 

PmPnPm = PnPm = [PmPnT = Pn- 

In summary, Pn ^ Pm- The conjugation rule shows that A*PmA =^ A*PmA. We 
conclude from (8.2) that 

ilPjvAll' = \\A*PnA\\ < \\A*PmA\\ = ||Pa^A||' . 

The second statement follows from the first by taking orthogonal complements. □ 

Finally, we need an unusual variation on the spectral radius formula. 

Proposition 8.6. Let P be an orthogonal projector, and let M be a matrix. For 
each nonnegative q, 

\\PM\\ < ||P(MM*)'M||^/^^'^+^^ . (8.4) 



Proof. Suppose that R is an orthogonal projector, D is a nonnegative diagonal 
matrix, and t > 1. We claim that 

\\RDR\\' < ||Pi:>*H|| . (8.5) 

Granted this inequality, we quickly complete the proof. Using an SVD M — VEV* , 
we compute 

||p^||2(2g+l) ^ \\pMM*Pf'^+^ = \\{U*PU) ■ S2 . {U* PU)\\^''^^ 

< ||(J7*Pi7) • S^'^-j+i) . ([/*p(7)|| ^ ||P(MM*)2(2<?+i)p|| 

= ||P(MM*)«M • M*{MM*yP\\ = \\P{MM*)iMf . 

We have used the unitary invariance of the spectral norm in the second and fourth 
relation. The inequality (8.5) applies because U*PU is an orthogonal projector. Take 
a square root to finish the argument. 

Now, we turn to the claim (8.5). This relation follows immediately from [11, 
Thm. IX. 2. 10], but we offer a direct argument based on more elementary considera- 
tions. Let a; be a unit vector at which 

x*{RDR)x = \\RDR\\ . 
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Wc must have Rx = x. Otherwise, ||i2a;|| < 1 because R is an orthogonal projector, 
which implies that the unit vector y = Rx/ \\Rx\\ verifies 

{Rx)*iRDR){Rx) x*(RDR)x 
y*{RDR)y = ' \^ ^^'^ = — ^ > x*{RDR)x. 



\Rx\\ 



\Rx\ 



Writing Xj for the entries of x and dj for the diagonal entries of £), wc find that 



\RDRY = [x*{RDR)xY = [x*DxY = '^[^ djx) 



< 



x*D*x = {Rx)*D\Rx) < ||-R-D*i2|| 



The inequality is Jensen's, which applies because '^x'j = 1 and the function z \z\ 
is convex for i > 1. □ 

9. Error bounds via linear algebra. We are now prepared to develop a de- 
terministic error analysis for the proto-algorithm described in §1.3. To begin, we 
must introduce some notation. Afterward, we establish the key error bound, which 
strengthens a result from the literature [17, Lem. 2]. Finally, we explain why the 
power method can be used to improve the performance of the proto-algorithm. 

9.1. Setup. Let A be an m x n matrix with singular value decomposition 
A = UHV*, as described in Section 3.2.2. The algorithms described construct ap- 
proximations to the space spanned by the first k left singular vectors, where k for 
now is a given number. To analyze the situation, we first partition the singular value 
decomposition of A as follows: 



k n — k 



U 



' Si 






^2 . 







k 

n ~ k 



(9.1) 



Let il be an n X ^ test matrix, where we assume only that £ > k. Decompose the 
test matrix in the coordinate system determined by the right unitary factor of A: 



r2i = V{ n and fl2 ^ V2 O. 



(9.2) 



The singular spectra of Hi and CI2 play a critical role in the analysis of the algorithm. 
Indeed, we ultimately set £ — k -\- p because we need to improve the conditioning of 
the block Jli. With this notation, the sample matrix Y can be expressed as 



ACt = U 



e 

SiOi 



k 

n — k 



(9.3) 



Intuitively, the top block reflects the gross behavior of A, while the second block 
represents a perturbation. 

9.2. A deterministic error bound for the proto-algorithm. The proto- 
algorithm constructs an orthonormal basis Q for the range of the sample matrix Y, 
and our goal is to quantify how well this basis captures the action of the input A. 
Since QQ* = Py, the challenge is to obtain boimds on the approximation error 



II A - QQ*A\l 



fI-Pv)A||| 
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The following theorem shows that the behavior of the proto-algorithm depends on 
the interaction between the test matrix and the right singular vectors of the input 
matrix, as well as the singular spectrum of the input matrix. 

Theorem 9.1 (Deterministic error bound). Let A be an m x n matrix with 
singular value decomposition A = UHV* , and fix k > 0. Choose a test matrix fl, 
and construct the sample matrix Y ~ A^l. Partition S as specified in (9.1), and 
define fli and 02 via (9.2). Assuming that Oi has full row rank, the approximation 
error satisfies 

1(1 - PY)Af < ISaf + IllS^OaOllf , (9.4) 

where |||'||| denotes either the spectral norm or the Frobenius norm. 

Theorem 9.1 sharpens [17, Lem. 2] by means of a different argument. Our proof 
strategy is inspired by the perturbation theory of orthogonal projectors [124]. 

Proof. We establish the bound for the spectral-norm error. The bound for the 
Frobenius-norm error follows from an analogous argument that is slightly easier. 

As a preliminary step, we check that U plays no essential role in the argument. 
In effect, we execute the proof for the auxiliary matrix 

Ao = U*A = SF*. 

Owing to the unitary invariance of the spectral norm and to Proposition 8.4, we have 
the identity 

\\{\-Py)A\\ = \\U*{1-Py)UA4 = \\{1-Pu'y)A4. (9.5) 

In words, we need to determine how well the range of U*Y explains Aq. 

The key idea is to identify a matrix with full column rank whose range lies within 
the range of U*Y . This step allows us to invoke (8.3), which yields an analytic 
expression for the orthogonal projector. In consideration of (9.3), set 

z = ((7*i")o|Sj;\ 

where we may assume that Si is invertible by a perturbation argument. This con- 
struction ensures that range(.Z) C T&ngc{U*Y). Proposition 8.5 implies 

\\{1-Pu'y)A4 < ||(I-Pz)Ao||. 

That is, we can better approximate the matrix Aq in the range of U*Y than in the 
range of Z . In view of (9.5), it holds that 

11(1 - PY)Af < 11(1 - Pz)A4^ = \\Al{\ - Pz)A4 = ||S*(I - Pz)n\ : (9.6) 

where the last identity follows from the unitary invariance of the spectral norm. Our 
goal now is to bound the right-hand side of (9.6). 

To continue, we need a detailed expression for the projector I — Pz- Referring 
back to (9.3) and invoking the assumption that fJi has full row rank, we obtain 



SiOi 

S2O2 
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where F = 'Tl2^2^\^i^ a-nd I represents a conformal identity matrix, here with 
dimension k x k. In particular, this expression shows that Z has fuU column rank, so 
we can apply (8.3) to see that the projector 

Pz = Z{Z*Z)-^Z*. 

Expanding this formula, we determine that the complementary projector satisfies 



I Pz = I 



(I + F*F)" 



'l-(l + F*F)-^ -{1 + F*F)-^F* 
-F{1 + F*F)-^ \- F{\ + F*F)-^F* 



(9.7) 



The partitioning here is conformal with the partitioning of S. As a result, when we 
conjugate the matrix by S, copies of Sj"^, presently hidden in the top-left block, will 
cancel to happy effect. 

The latter point may not seem obvious, owing to the complicated form of (9.7). 
In reality, the block matrix is less fearsome than it looks. Proposition 8.2, on the 
perturbation of inverses, shows that the top-left block verifies 

I - {1 + F*F)-^ ^ F*F. 

The bottom-right block satisfies 

I- F{I + F*F)-'^F* ^ I, 

because the conjugation rule guarantees that F(I + F* F)^^ F* )>= 0. We abbreviate 
the off-diagonal blocks with the symbol B = —{1 + F* F)^^ F* . In summary, 



F*F B 
B* I 



This relation exposes the key structural properties of the projector. 

Moving toward the final estimate, we conjugate the last inequality by S to obtain 



The conjugation rule demonstrates that the matrix on the left-hand side is psd, so 
the matrix on the right-hand side is too. Proposition 8.3 results in the norm bound 

||S*(I-Pz)S|| < ||SIP*FSi|| + ||S*S2|| = ||PSi||' + llSaf . 

t — 1 

Recall that F = S2r22i^i5]j^ , so the factor Si cancels neatly. Therefore, 



|S*(I-Pz)S|| < ||S2fl2f^l| 



Looking back to (9.6), we discover that the proof is complete. □ 
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9.3. Analysis of the power scheme. Theorem 9.1 suggests that the perfor- 
mance of the proto-algorithm depends strongly on the relationship between the large 
singular values of A listed in Si and the small singular values listed in S2. When a 
substantial proportion of the mass of A appears in the small singular values, the con- 
structed basis Q may have low accuracy. Conversely, when the large singular values 
dominate, it is much easier to identify a good low-rank basis. 

To improve the performance of the proto-algorithm. we can run it with a closely 
related input matrix whose singular values decay more rapidly [67, 111]. Fix a non- 
negative integer g, and set 

B = {AA*yA ^ 

We apply the proto-algorithm to B, which generates a sample matrix Z = Bfl and 
constructs a basis Q for the range of Z. Section 4.5 elaborates on the implementation 
details. The following result describes how well we can approximate the original 
matrix A within the range of Z . 

Theorem 9.2 (Power scheme). Let A be an mxn matrix, and let be an nx i 
matrix. Fix a nonnegative integer q, form B = [A* AY A, and compute the sample 
matrix Z = JBr2. Then 

\\{l-Pz)A\\ < ||(I-Pz)B||i/(2«+i'. 



Proof. We can estimate that 

||(I~Pz)A|| < ||(I-Pz)(AA*)'A||'/^'''+'' = ||(I-Pz)B||'/'''+'' 

as a simple consequence of Proposition 8.6. □ 

Let us illustrate how the power scheme interacts with the main error bound (9.4). 
Fix a matrix A, and let (Tfc+i denote its (fc -f l)th singular value. First, suppose we 
approximate A in the range of the sample matrix Y = AO. Since IIS2II = crfc+i, 
Theorem 9.1 implies that 

\\{\-Py)A\\ < (l + ||020l||y^''afe+i. (9.8) 

Now, define B = {AA*YA, and suppose that we approximate A within the range of 
the sample matrix Z ~ Bft. Together, Theorem 9.2 and Theorem 9.1 imply that 

||(i-Pz)A|| < ||(i-Pz)B||i/(^^+^) < [i + \\n.M\\Y^''^'^-k+i 

because cr^^i is the (fc + l)th singular value of B. In effect, the power scheme drives 
down the suboptimality of the bound (9.8) exponentially fast as the power q increases. 
In principle, we can make the extra factor as close to one as we like, although this 
increases the cost of the algorithm. 

Remark 9.1. When we apply the power scheme numerically, we obtain addi- 
tional Krylov information that can be exploited to improve the numerical accuracy 
of the orthogonalization step. The resulting procedure appears as Algorithm 4.3. 
Disregarding numerical effects, we can develop an error bound for this scheme as an 
immediate corollary of Theorem 9.2. For a rounding-error analysis, sec [111]. 
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Corollary 9.3 (Extended power scheme). Instate the hypotheses and notation 
of Theorem 9.2. Define the matrices l^*-*' = [AA*yAQ, for i = 0,1,..., and 

construct the extended sample matrix 

= . (9.9) 

Then 

\\{l-P^)A\\<\\{l^Pz)Bf^^'^+'\ 



Proof. Observe that range(Z) C range(Ty). Then invoke Proposition 8.5 and 
Theorem 9.2. □ 

10. Gaussian test matrices. The error bound in Theorem 9.1 shows that 
the performance of the proto-algorithm depends on the interaction between the test 
matrix $7 and the right singular vectors of the input matrix A. Algorithm 4.1 is a 
particularly simple version of the proto-algorithm that draws the test matrix according 
to the standard Gaussian distribution. The literature contains a wealth of information 
about these matrices, which results in a sharp error analysis. 

We focus on the real case in this section. Analogous results hold in the complex 
case, where the algorithm even exhibits superior performance. 

10.1. Technical background. A standard Gaussian matrix is a random matrix 
whose entries are independent standard normal variables. The distribution of a stan- 
dard Gaussian matrix is rotationally invariant: If U and V are orthogonal matrices, 
then U*GV also has the standard Gaussian distribution. 

Our analysis requires detailed information about the properties of Gaussian ma- 
trices. In particular, we must understand how the norm of a Gaussian matrix and its 
pseudoinverse vary. We summarize the relevant results and citations here, reserving 
the details for Appendix A. 

Proposition 10.1 (Expected norm of a scaled Gaussian matrix). Fix matrices 
S^T, and draw a standard Gaussian matrix G. Then 

(e||5GT*!||)'^' = ||5||p||T||p (10.1) 
E \\SGT*\\ < \\S\\ \\T\\p + \\S\\p \\T\\ . (10.2) 



The identity (10.1) follows from a direct calculation. The second bound (10.2) 
relies on methods developed by Gordon [62,63]. See Propositions A.l and A. 2. 

Proposition 10.2 (Expected norm of a pseudo-inverted Gaussian matrix) . Draw 
a k X (k + p) standard Gaussian matrix G with p > 2. Then 



1/2 



E||Gt||<2^^. (10.4) 
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The first identity is a standard result from multivariate statisties [100, p. 96]. The 
second follows from work of Chen and Dongarra [28]. See Proposition A. 4 and A. 5. 

To study the probability that Algorithm 4.1 produces a large error, we rely on 
tail bounds for functions of Gaussian matrices. The next proposition rephrases a 
well-known result on concentration of measure [14, Thm. 4.5.7]. See also [85, §1.1] 
and [84, §5.1]. 

Proposition 10.3 (Concentration for functions of a Gaussian matrix). Suppose 
that h is a Lipschitz function on matrices: 

\h{X)-h{Y)\<L\\X-Y\\p forallX,Y. 

Draw a standard Gaussian matrix G. Then 

¥{h{G) > Eh{G) + Lt} < e-*'/2. 



Finally, we state some large deviation bounds for the norm of a pseudo-inverted 
Gaussian matrix. 

PROPOSITION 10.4 (Norm bounds for a pseudo-inverted Gaussian matrix). Let 
G be a k X {k + p) Gaussian matrix where p > 4. For all t > I, 



\G''\L> J— -t} <4t-P, and (10.5) 



|Gt||>^^^.4<t-(^+i). (10.6) 
I II p+l 



Compare these estimates with Proposition 10.2. It seems that (10.5) is new; we 
were unable to find a comparable analysis in the random matrix literature. Although 
the form of (10.5) is not optimal, it allows us to produce more transparent results than 
a fully detailed estimate. The bound (10.6) essentially appears in the work of Chen 
and Dongarra [28]. See Propositions A. 3 and Theorem A. 6 for more information. 

10.2. Average-case analysis of Algorithm 4.1. We separate our analysis 
into two pieces. First, we present information about expected values. Afterward, we 
provide bounds on the likelihood of a large deviation. 

We begin with the simplest result, which provides an estimate for the expected 
approximation error in the Frobenius norm. All proofs are postponed to the end of 
the section. 

Theorem 10.5 (Average Frobenius error). Suppose that A is a real m x n 
matrix with singular values ai > (72 > cs > • • • • Choose a target rank k and an 
oversampling parameter p > 2, where k + p < min{TO, ??}. Draw an n x {k + p) 
standard Gaussian matrix Jl, and construct the sample matrix Y = A^l. Then the 
expected approximation error 

E|l(I-iV)-4||,<(l + -^)"'(^^^.|)"'. 
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This theorem encapsulates several distinct — and intriguing — behaviors of Algo- 
rithm 4.1. The Eckart- Young theorem [53] shows that (X]j>fc "'j )^^^ minimal 
Frobenius-norm error when approximating A with a rank-A: matrix. This quantity is 
the appropriate benchmark for the performance of the algorithm. If the small singular 
values of A are very flat, the series may be as large as ak+i \/ min{?Ti, n} — k. On the 
other hand, when the singular values exhibit some decay, the error may be on the 
same order as (Tfc+i. 

The error bound always exceeds this baseline error, but it may be polynomially 
larger, depending on the ratio between the target rank fc and the oversampling pa- 
rameter "p. For p small (say, less than five) , the error is actually quite variable because 
the small singular values of a nearly square Gaussian matrix are very unstable. As 
the oversampling increases, the performance improves quickly. When p ~ fc, the error 
is already within a constant factor of the baseline. 

The error bound for the spectral norm is somewhat more complicated, but it 
reveals some interesting new features. 

Theorem 10.6 (Average spectral error). Under the hypotheses of Theorem 10.5, 




E 11(1 - P^)A\\ < I 1 + J -_ I + (V _ 



Mirsky [99] has shown that the quantity ak+i is the minimum spectral-norm error 
when approximating A with a rank-fc matrix, so the first term in Theorem 10.6 is 
analogous with the error bound in Theorem 10.5. The second term represents a new 
phenomenon: we also pay for the Frobenius-norm error in approximating A. Note 
that, as the amount p of oversampling increases, the polynomial factor in the second 
term declines much more quickly than the factor in the first term. Consider what 
happens when p ^ k. 

Finally, we remark that the bound in Theorem 10.6 implies 



1(1 -Py) All < 



; ey/k+p ^— ^ r 

1 + W H • WmuUmjfil — k 

y p-1 p 



so the average spectral-norm error always lies within a small polynomial factor of the 
baseline ak+i- 

We continue with the proofs of these results. 

Proof. [Theorem 10.5] Let V be the right unitary factor of A. Partition V = 
[Vi I V2] into blocks containing, respectively, k and n — k columns. Recall that 

fti = V*n and O2 = ^2*^^- 

The Gaussian distribution is rotationally invariant, so V*n is also a standard Gaus- 
sian matrix. Observe that Hi and fl2 are disjoint submatrices of V*r2, so these two 
matrices are not only standard Gaussian but also stochastically independent. Fur- 
thermore, the rows of a (fat) Gaussian matrix are almost surely in general position, 
so the fc X (k + p) matrix fli has full row rank with probability one. 
Holder's inequality and Theorem 9.1 together imply that 

E\\{I- Py)A\\p < (e\\{I-Py)A\\IV^^ < (IIS2II' +E||S2fl2f^I||p)'^'. 
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Wc compute this expectation by conditioning on the value of fti and applying Propo- 
sition 10.1 to the scaled Gaussian matrix r22. Thus, 

E||i;2f^2fll||p =e(e [||S2J^2f^I||p I f^ll) =E(i|I]2||F||f2l||p 



= llS2||^E||^2^||p = -^.llS2||p, 

where the last expectation follows from relation (10.3) of Proposition 10.2. In sum- 
mary, 

E||(I-P^)A||p< (^l + -A^y^'i|S2||p. 

Observe that |lS2||p = J2j>k t° complete the proof. □ 

Proof. [Theorem 10.6] The argument is similar the proof of Theorem 10.5. First, 
Theorem 9.1 and the (deterministic) Holder inequality imply that 

1 /2 

E\\{1-Py)A\\ <E(||S2f + ||S2rJ2nlf) < IIS2II +E||S2f220l||. 

We condition on Jli and apply Proposition 10.1 to bound the expectation with respect 
to Jla- Thus, 



|2 



£1122^2^^111 < E (IIS2II ||f^I||p + IIS2IIP II^Ill) 
(E||f2t||p) +IIS2IIP 



< IIS2II [E\\n[\\^) +||S2||p-E||f2l||. 

where the second relation requires Holder's inequality. Applying both parts of Propo- 
sition 10.2, we obtain 



e||S2mI|| < J-^!|S2|| + ^^^^||S2!lp. 

Note that IIS2II = o-k+i to wrap up. □ 

10.3. Probabilistic error bounds for Algorithm 4.1. We can develop tail 
bounds for the approximation error, which demonstrate that the average performance 
of the algorithm is representative of the actual performance. We begin with the 
Frobenius norm because the result is somewhat simpler. 

Theorem 10.7 (Deviation bounds for the Frobenius error) . Frame the hypotheses 
of Theorem 10.5. Assume further that p > 4. For all u,t> 1, 



I (I - Pv)Ai|p < (1 + 1 . v/m7^) {Y.^^, + ■ • Tk+i, 



except with probability 5t^P + 2e^" ' . 



To parse this theorem, observe that the first term in the error bound corresponds 
with the expected approximation error in Theorem 10.5. The second term represents 
a deviation above the mean. In general, the mean dominates the size of a deviation. 



58 



HALKO, MARTINSSON, AND TROPP 



but the deviations can be substantially smaller when the oversampling parameter p 
is large or the singular values of the input matrix decay slowly. 
An analogous result holds for the spectral norm. 

Theorem 10.8 (Deviation boimds for the spectral error). Frame the hypotheses 
of Theorem 10.5. Assume further that p > 4. For all u,t> \, 



Kl -Py)A\\ 

except with probability 5t^P + e"" 



cy/k+p 

+ Ut — — CTfc+l, 

p + I 



The bracket corresponds with the expected spectral-norm error while the remain- 
ing term represents a deviation above the mean. Neither the numerical constants nor 
the precise form of the bound are optimal because of the slackness in Proposition 10.4. 
Nevertheless, the theorem gives a fairly good picture of what is actually happening. 

We acknowledge that the current form of Theorem 10.8 is complicated. To pro- 
duce more transparent results, we make appropriate selections for the parameters u, t 
and compute the numerical constants. 

Corollary 10.9 (Simplified deviation bounds for the spectral error). Frame 
the hypotheses of Theorem 10.5, and assume further that p > 4. Then 



1(1 - Py)A\\ < (l + 17v/TTI7^) a.+i + rV a?^'^' 



p + 1 \^3>k ^ 

except with probability 6e~^. Moreover, 

\\{1-Py)A\\ < (l + 8v/(fc+p)-plogp) ^k+i+-iVk+^{Y.j>k^'i) 



1/2 



except with probability 6p ^ . 

Proof. The first part of the result follows from the choices t = e and u = ^/2p, 
and the second emerges when t = p and u = ^/2plogp. Another interesting parameter 
selection is t = p'^^^ and u = \/2c logp, which yields a failure probability 6p~'^. □ 

Corollary 10.9 should be compared with [94, Obs. 4.4-4.5]. Although our result 
contains sharper error estimates, the failure probabilities are usually worse. 

We continue with a proof of Theorem 10.8. The same argument can be used to 
obtain a bound for the Frobenius-norm error, but we omit a detailed account. 

Proof. [Theorem 10.8] Since fJi and fl2 are independent from each other, we can 
study how the error depends on the matrix fl2 by conditioning on the event that Hi 
is not too irregular. To that end, we define a (parameterized) event on which the 
spectral and Frobenius norms of the matrix fl\ are both controlled. For t > 1, let 



Et = \n,: \\n\\\ <^^lE±P.t and 
p+1 

Invoking both parts of Proposition 10.4, we find that 




RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 



59 



Consider the function h{X) = ||S2Xr2|||. We quickly compute its Lipschitz 
constant L with the lower triangle inequality and some standard norm estimates: 

\h{x)-hiY)\ < ||S2(x-r)nl|| 

<||S2|| \\X-Y\\\\n\\\ <||S2||||f2l|| ||X-r||p. 
Therefore, L < IIS2II ||^i||- Relation (10.2) of Proposition 10.1 implies that 

E[h{n2) I ni] < 11S2II + w^^h 

Applying the concentration of measure inequality, Proposition 10.3, conditionally to 
the random variable h{fl2) ~ \\'S2^2^i\\p results in 

p{||S2f^2f^I||F > 11^211 ||flI||p + ||S2|lF||f^l|| +IIS2II -U I St} <e-"'/2. 

Under the event Et. we have explicit bounds on the norms of Jlj, so 



P ||S2J22JlI IIf > |1S2!1 \ —-t+ IIS2IIP • < + "-^^ ■ ut E, 

y p p + i p + i 

< e-"'/2. 

Use the fact P {ED < 5t~^ to remove the conditioning. Therefore, 



' p + 1 p + l 

< 5rP + e 



Insert the expressions for the norms of S2 into this result to complete the probability 
bound. Finally, introduce this estimate into the error bound from Theorem 9.1. □ 

10.4. Analysis of the power scheme. Theorem 10.6 makes it clear that the 
performance of the randomized approximation scheme. Algorithm 4.1, depends heav- 
ily on the singular spectrum of the input matrix. The power scheme outlined in 
Algorithm 4.3 addresses this problem by enhancing the decay of spectrum. We can 
combine our analysis of Algorithm 4.1 with Theorem 9.2 and Corollary 9.3 to obtain 
a detailed report on the behavior of the performance of the power scheme using a 
Gaussian matrix. 

Corollary 10.10 (Average spectral error for the power scheme). Frame the 
hypotheses of Theorem 10.5. Define B ~ {AA*Y A for a nonnegative integer q, and 
construct the sample matrix Z = Bft. Then 



E|l(I-P^)Ai| < 



1/(29+1) 

2q+l , C^fc + p ^2(29+1) 



p-l)"''+^ ^ p \^j>k^^ 



The same bound holds when we replace Z by the extended sample matrix W from (9.9). 
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Proof. By Holder's inequality and Theorem 9.2, 

E\\il~ Pz)A\\ < (E\\{l-Pz)Af'+'y^^^''^'^ < (E||(I-Pz)B||)i/(2«+i'. 

Invoke Theorem 10.6 to bound the right-hand side, noting that crj{B) = '^'j'^^^- The 
result for the extended sample matrix follows if we use Corollary 9.3 in place of 
Theorem 9.2. □ 

The real message of Corollary 10.10 emerges if we bound the series using its 
largest term a^'^j^ and draw the factor ak+i out of the bracket: 



E\\{I-Pz)A\\ < 

In words, as we increase the exponent q, the power scheme drives the extra factor in 
the error to one exponentially fast. By the time q ^ log (min{m, n}), 

E\\{l-Pz)A\\^ak+i, 

which is the baseline for the spectral norm. 

To obtain large deviation bounds for the performance of the power scheme, sim- 
ply combine Theorem 9.2 or Corollary 9.3 with Theorem 10.8. We omit a detailed 
statement. 

Remark 10.1. We lack an analogous theory for the Frobenius norm because 
Theorem 9.2 depends on Proposition 8.6, which is not true for the Frobenius norm. 

11. SRFT test matrices. Another way to implement the proto-algorithm from 
§1.3 is to use a structured random matrix so that the matrix product in Step 2 can 
be performed quickly. One type of structured random matrix that has been proposed 
in the literature is the subsampled random Fourier transform, or SRFT, which we 
discussed in §4.6. In this section, we present bounds on the performance of the proto- 
algorithm when it is implemented with an SRFT test matrix. In contrast with the 
results for Gaussian test matrices, the results in this section hold for both real and 
complex input matrices. Unfortunately, our results for the SRFT are also substantially 
less refined. 

11.1. Construction and Properties. Recall from §4.6 that an SRFT is a tall 
n X ( matrix of the form fl = \/n/i ■ DFR* where 

• D is a random n x n diagonal matrix whose entries are independent and 
uniformly distributed on the complex unit circle; 

• is the n X n unitary discrete Fourier transform; and 

• i? is a random £ x n matrix that restricts an ri-dimensional vector to £ coor- 
dinates, chosen uniformly at random. 

Up to scaling, an SRFT is just a section of a unitary matrix, so it satisfies the norm 
identity ||0|| = ^/nji. 

This design may seem mysterious, but there are clear intuitions to support it. 
Suppose that we want to estimate the energy (i.e., squared £2 norm) of a fixed vector 
X by sampling £ of its n entries at random. On average, these random entries carry 
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tjn of the total energy. (The factor \fnji reverses this scahng.) When x has a 
few large components, the variance of this estimator is very high. On the other hand, 
when the components of x have comparable magnitude, the estimator has much lower 
variance, so it is precise with very high probability. 

The purpose of the matrix product DF is to flatten out input vectors before we 
sample. To see why it achieves this goal, fix a unit vector a;, and examine the first 
component of x*DF. 



En 
i—1 



where fij are the components of the DFT matrix F. This random sum clearly has zero 
mean. Since the entries of the DFT have magnitude the variance of the sum 

is n^^. Hoeffding's inequality [71] shows that the magnitude of the first component 
of x*DF is on the order of with extremely high probability. The remaining 

components have exactly the same property. 

In fact, an appropriately designed SRFT approximately preserves the geometry 
of an entire subspace of vectors. We present a weaker result that is adequate for our 
purposes. 

Theorem 11.1 (The SRFT preserves rank). Fix anx k orthonormal matrix W 
with k > 246, and draw an n x £ SRFT matrix fl where the parameter I satisfies 

2 



24: Vk+ v/81og(25n) log(4fc) < £ < n. 
Then 

with probability at least 1/2. 

In words, the kernel of an SRFT of dimension £ ^ k log k is unlikely to intersect 
a fixed fc-dimensional subspace. In contrast with the Gaussian case, the logarithmic 
factor logfc in the lower bound on £ cannot generally be removed (Remark B.2). 
Although this theorem only provides a constant failure probability, the failure rate in 
practice is polynomial in k^^. 

The proof of Theorem 11.1 appears in Appendix B. The argument closely fol- 
lows [116, Thm. 3.1] and [132, Sec. 9]. Some related results appear in [103]. 

Remark 11.1. For large problems, we have been able to reduce the numerical 
constants further. If 1 ^ log(n) <C k, then sampling 

£> 9fclog(4fc) 

coordinates is sufficient to ensure that <Tk{Wn) > 1/VT8 with probability 0{k~^^^'^). 
Although we have taken some pains to calculate explicit and reasonable constants, 
we must emphasize that no significance attaches to the precise values stated here. 
This prejudice toward constants is the reason for the weak bounds on the failure 
probability. 



Remark 11.2. This discussion suggests that the precise form of the SRFT is 
not particularly important. Indeed, we can replace F by any unitary matrix whose 
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entries are uniformly small and which is equipped with a fast matrix-vector multiply. 
Likewise, we can choose other distributions for the diagonal entries of D, such as 
random signs. 



11.2. Performance guarantees. We are now prepared to present detailed in- 
formation on the performance of the proto-algorithm when the test matrix H is an 
SRFT. 

Theorem 11.2 (Error bounds for SRFT). Fix an m x n matrix A with singular 
values (Ti > (72 > (73 > . . . . Fix a number k > 246, and draw an n x £ SRFT matrix 
Jl, where 

24: Vk+ V81og(25n) ^ log(4fc) < £ < n. 
Construct the sample matrix Y = Afl. Then 



11(1 -Py) Ail < Vl + 37i/£ • CTfc+i and 



1/2 



with probability at least 1/2. 



As we saw in §10.2, the quantity ak+i is the minimal spectral-norm error possible 
when approximating A with a rank-Zc matrix. Similarly, the series in the second bound 
is the minimal Frobenius-norm error when approximating A with a rank-fc matrix. 
We see that both error bounds lie within a polynomial factor of the baseline, and this 
factor decreases with the number £ of samples we retain. 

In practice, £ = klogk is a reasonable choice for the sampling parameter. Al- 
though the factor logfc cannot generally be removed, experiments suggest that the 
selection ^ = /c + 20 is often adequate. 

The likelihood of error with an SRFT test matrix is much higher than in the 
Gaussian case. In practice, the failure probability here is polynomial in k~^, while in 
the Gaussian failure probability is e"*^^"*'-' or better. This difference is not an artifact 
of the analysis. Discrete sampling techniques inherently fail with higher probability 
because of coupon collection issues (Remark B.2). 

We finish the section with the proof of Theorem 11.2. 

Proof. [Theorem 11.2] Let V be the right unitary factor of matrix A, and partition 
V = [Vi I V2] into blocks containing, respectively, k and n~ k columns. Recall that 

J7i = V*n and Q.2 = V^^. 

where $7 is the conjugate transpose of an SRFT. Theorem 11.1 ensures that the 
submatrix Vti has full row rank with probability at least 1/2. Therefore, Theorem 9.1 
implies that 



il~Px)A\\ < IISo 



l+llf^lll ■\\n2f 



1/2 



where |||'||| denotes either the spectral norm or the Frobenius norm. Our application 

i spectral nc 

< Vs. 



of Theorem 11.1 also ensures that the spectral norm of ft\ is under control: 
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We may bound the spectral norm of 0,2 dctcrministically. 

Iiriall = 11^2*^^11 < ll'^2*ll ll^^ll = VW^ 

since V2 and \/(/n ■ ft arc both orthonormal matrices. Combine these estimates to 
complete the proof. □ 
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Appendix A. On Gaussian matrices. This section collects some of the prop- 
erties of Gaussian matrices that we use in our analysis. Most of the results follow 
quickly from material that is already available in the literature. In one case, however, 
we require a surprisingly difficult new argument. We focus on the real case here; the 
complex case is similar but actually yields better results. 

A.l. Expectation of norms. We begin with the expected Frobenius norm of 
a scaled Gaussian matrix, which follows from an easy calculation. 

Proposition A.l. Fix matrices S,T, and draw a standard Gaussian matrix G. 
Then 

(e \\SGT* ^ ^\\S\\p\\T\\^. 

Proof. The distribution of a Gaussian matrix is invariant under orthogonal trans- 
formations, and the Frobenius norm is also unitarily invariant. As a result, it repre- 
sents no loss of generality to assume that S and T are diagonal. Therefore, 

Since the right-hand side is unitarily invariant, we have also identified the value of 
the expectation for general matrices S and T. □ 

The expected spectral norm of a scaled Gaussian matrix is also known. The result 
is due to Gordon [62, 63], who established the bound using a sharp version of Slepian's 
lemma. See [85, §3.3] and [37, §2.3] for additional discussion. 

PROPOSITION A. 2. Fix matrices S,T, and draw a standard Gaussian matrix G. 
Then 

E\\SGT*\\ < ||S||||Tl|p + ||5||p||T||. 
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Remark A.l. Proposition A. 2 provides a very accurate bound. When S and T 
are diagonal, the first term \\S\\ \\T\\-p is roughly the maximum expected £2 norm of a 
row of SGT* ; the second term is the maximum expected £2 norm of a column. Each 
of these quantities is a lower bound for the spectral norm. Furthermore, when S, T 
are identity matrices, the result is asymptotically sharp as the size of the matrix G 
tends to infinity with the ratio of its dimensions fixed. 

A. 2. Spectral norm of pseudoinverse. Now, we turn to the pseudoinverse 
of a Gaussian matrix. Recently, Chen and Dongarra developed a good bound on 
the probability that its spectral norm is large. The statement here follows from [28, 
Lem. 4.1] after an application of Stirling's approximation. See also [94, Lem. 2.14] 

Proposition A. 3. Let G be an m x n standard Gaussian matrix with n > m. 
For each t > I, 



'{11^1 >t} < 



y^27r(n - m + 1) 



n — m + 1 



n — m+l 



t 



-(n— m+1) 



We can use Proposition A. 3 to bound the expected spectral norm of a pseudo- 
inverted Gaussian matrix. 

Proposition A. 4. Let G be a m x n standard Gaussian matrix with n — m > 1. 
Then 

E||Gt|l ^ 



< 



n — m 



Proof. Let us make the abbreviations p = n — m and 



C 



C\/n 



p+l 



We compute the expectation by way of a standard argument. The integral formula 
for the mean of a nonnegative random variable implies that, for all E > 0, 

E\\G^\= P {\\G^\ > t} dt < E + P{||G^|| > t} di 

Jo J E 

poo -, 

<E + C dt = E+ -CE~P, 

Je P 

where the second inequality follows from Proposition A. 3. The right-hand side is 
minimized when E = C^/(p+^\ Substitute and simplify. □ 

A. 3. Probenius norm of pseudoinverse. The squared Frobenius norm of a 
pseudo-inverted Gaussian matrix is closely connected with the trace of an inverted 
Wishart matrix. This observation leads to an exact expression for the expectation. 

Proposition A. 5. Let G be an m xn standard Gaussian matrix with n — m > 2. 
Then 
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Proof. Observe that, almost surely, 



||Gt||p = trace [{G^yC^ = trace [{GG*)-^] 

The random matrix [GG*]^^ follows the inverted Wishart distribution. As a result, 
its expected trace can be computed directly [100, p. 97]. □ 

On the other hand, very little sees to be known about the tail behavior of the 
Frobenius norm of a pseudo-inverted Gaussian matrix. The following theorem, which 
is new, provides an adequate bound on the probability of a large deviation. 



Theorem A. 6. Let G be an m x n standard Gaussian matrix with n 
For each t > 1, 



m > 4. 



\G%> 



12m 



■t)' < 4t-^ 



— (n—m 



)/2 



Neither the precise form of the estimate nor the constants are ideal, as we explain 
in Remark A. 2. Our goal has been to develop a useful bound with minimal fuss. 

The rest of the section is devoted to the rather lengthy proof. Unfortunately, 
most of the standard methods for producing tail bounds fail for random variables 
that do not exhibit normal or exponential concentration. Our argument relies on 
special properties of Gaussian matrices and a dose of brute force. 

A. 3.1. Technical background. We begin with a piece of notation. For any 
number g > 1, we define the Lg norm of a random variable Z by 



E^Z) = (E\Z 



In particular, the Lq norm satisfies the triangle inequality. 

We continue with a collection of technical results. First, we present a striking 
fact about the structure of Gaussian matrices [54, §3.5]. 

Proposition A. 7. An m.xn standard Gaussian matrix is orthogonally equivalent 
with the bidiagonal matrix 



Ym-l 



Ym-2 



X. 



n-2 



(A.l) 



where, for each j, the random variables Xj and Y^ follow the distribution with j 
degrees of freedom. Furthermore, these variates are mutually independent. 

We also require the moments of a chi-square variate, which arc expressed in terms 
of special functions. 

Proposition A. 8. Let S be a variate with k degrees of freedom. When 
0<q< k/2, 

2^r(fc/2 + ,) ^ Tik/2 q) 

^ ' r(fc/2) ^ 29r(fc/2) 
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Proof. Recall that a variatc with k degrees of freedom has the probability 
density function 



fit) 



1 



2'^V2r(fc/2) 
By the integral formula for expectation 



^fc/2-l^-t/2 for t > 0. 



E(5«)= / t^f{t)dt^ 

JQ 



2'?r(fc/2 + g) 
T{k/2) ' 



where the second equality follows from Euler's integral expression for the gamma 
function. The other calculation is similar. □ 

To simplify the proof, we need to eliminate the gamma functions. The next result 
bounds the positive moments of a chi-square variate. 

Lemma A. 9. Let 'E. be a variate with k degrees of freedom. For q > 1, 

E'^iE) <k + q. 



Proof. Write q = r + 9, where r = [q\. Repeated application of the functional 
equation zT{z) = T{z + 1) yields 



r(fc/2) + 



E«(S) = 

The gamma function is logarithmically convex, so 
2'^r(fc/2 + 0) ^2'^ ■ T{k/2y-'' ■ T{k/2 + 1) 



1/9 



T{k/2) 



r(fc/2) 



= k'' < 



llik + 2{q-j)) 



e/r 



The second inequality holds because k is smaller than each term in the product, hence 
is smaller than their geometric mean. As a consequence, 



E«(S) < 



l[{k + 2{q-j)) 



1/r 



< i^(fc + 2(g-j))<fc 



j=i 



The second relation is the inequality between the geometric and arithmetic mean. □ 

Finally, we develop a bound for the negative moments of a chi-square variate. 

Lemma A. 10. Let E be a variate with k degrees of freedom, where fc > 5. 
When2<q< {k-l)/2, 
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Proof. Wc establish the bound for q ~ {k — l)/2. For smaller values of q, the 
result follows from Holder's inequality. Proposition A. 8 shows that 





rr(fc/2-g)l 


1/9 


r r(i/2) 1 




. 29r(fc/2) 




.2«r(fc/2)_ 



Stirling's approximation ensures that T{k/2) > \/2t: ■ (fc/2)^*' ^^/^ • e Since the 
value r(l/2) = 



1/9 



1/29 



<^' 



_2'^^/2^• (fc/2)« • e-9-i/2_ 

where we used the assumption g > 2 to complete the numerical estimate. □ 

A. 3. 2. Proof of Theorem A. 6. Let G be an m x n Gaussian matrix, where 
we assume that n — m > 4. Define the random variable 



G 



Our goal is to develop a tail bound for Z. The argument is inspired by work of 
Szarek [129, §6] for square Gaussian matrices. 

The first step is to find an explicit, tractable representation for the random vari- 
able. According to Proposition A. 7, a Gaussian matrix G is orthogonally equivalent 
with a bidiagonal matrix L of the form (A.l). Making an analogy with the inversion 
formula for a triangular matrix, we realize that the pseudoinverse of L is given by 



ri-1 



X. 



n-2 



-Yi 



X 



n— (m— 1) 



Since is orthogonally equivalent with G^ 



m-l / 

j=o \ 



r2 . 



where we have added an extra subdiagonal term (corresponding with = 0) so that 
we can avoid exceptional cases later. We abbreviate the summands as 



^n-j+1 



, j=0,l,2,...,m-l. 



Next, we develop a large deviation bound for each summand by computing a mo- 
ment and invoking Markov's inequality. For the exponent q = (n — m)/2, Lemmas A. 9 
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and A. 10 yield 



y2 . 



< E'(X-4.) [1 + &iY^_,) ■ E^X-^^^,)] 



< 



n- J 
3 

n- j 



3(m - j + q) 



1 + 3- 



n - j + 1 

3(rt — m + 1 ~ q) 



n - j + 1 



Note that the first two relations require the independence of the variates and the 
triangle inequality for the Lq norm. The maximum value of the bracket evidently 
occurs when j = 0, so 



Markov's inequality results in 



12 
n~ j' 



J =0,1,2,. ..,m-l. 



12 

n-j 



Select u = t ■ {n — j)/{n — rn) to reach 





n — 


m 










n — 


j . 



To complete the argument, we combine these estimates by means of the union 
bound and clean up the resulting mess. Since Z < X^JIq^ 

12to 



Z > 



3=0 



1 1 



n — J 



To control the sum on the right-hand side, observe that 



x^dx 

(n — raY , ^ „ i i 2(n — m) 

< ^ — 7^(" - "ir-'+i = ^ — ^ < 4, 

q — \ n ~ m ^ 2 

where the last inequality follows from the hypothesis n — m > 4. Together, the 
estimates in this paragraph produce the advertised bound. 

Remark A. 2. We have presented the simplest argument we know that produces a 
reasonable result. With a sufficient investment of care, it is possible to obtain sharper 
estimates. For example, one can establish that the tail bound actually decays like 
^-(n-m+i)/2 j^y controlling the summands Wj more delicately. 

One can also show that the correct scale for large deviations is (n — m)~^ , rather 
than m{n — m)^^. This argument depends on the observation that the even-indexed 



,t^L"-jJ Jo 
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summands Wq, W2, W4, . . . are mutually independent. To analyze the sum of the even 
terms, we can use truncation and Bernstein's inequality. This approach also requires 
a detailed estimate of the variance of the truncated sum. The odd terms receive a 
parallel treatment. Unfortunately, this proof requires some fortitude. 

It would be very interesting to find a direct argument that yields concentration 
results for inverse spectral functions of a Gaussian matrix. 

Appendix B. Subsampled random Fourier transforms. This section demon- 
strates that a structured dimension reduction map, caUed the SRFT, preserves the 
geometry of an entire subspace of vectors. These results have strong precedents in [116, 
Thm. 3.1] and [132, §9]. See also [103]. 

B.l. Technical background. We begin with some notation. For any q > I, 
we write 



for the Lq norm of the random variable Z. We also use a special notation for the 
expectation with respect to one random variable, holding others fixed. 



A Rademacher random variable takes the values ±1 with equal probability. We 
reserve the letter ^ for a Rademacher variable, and we often write ^ for a vector 
whose entries are independent Rademacher variables. A Steinhaus random variable 
is uniformly distributed on the complex unit circle. We reserve e and e for Steinhaus 
random variables and vectors. 

We need to distinguish two different models for a random subset T of the indices 
{1, 2, . . . , n}. Let £ be an integer in the range [0, n]. 

Dependent Model Let Si,.. .,6,1 be 0-1 random variables that satisfy ~ 

with all such choices equally likely. Observe that the random variables have 
common mean S = £/n, but they are not independent. 

Independent Model Let . . . , i5„ be independent 0-1 random variables with com- 
mon mean d ^ £/n. 

In each case, we define a random set T = { j : Sj = 1}. In the dependent model, we 
interpret T as a uniformly random subset of {1, 2, . . . , n} with cardinality £. In the 
independent model, T is a random subset with expected cardinality £. 



Given a subset T of indices in {1,2, ... ,n}, we define the restriction operator 
Rt ■■ C" ^ C"^ via the rule 



When T is a random subset, Rt is a restriction to a random set of coordinates. 

In this appendix, we find it more convenient to work with the conjugate transpose 
of the SRFT matrix that we work with in the body of the paper. We use a different 
letter to preserve the distinction, so it should present no confusion that we also call 
this transform an SRFT. For I < £ < n, wc define the £ x n matrix 



Exf{X,Y)=E[f{X,Y) I Y 



{Rtx){])^Xj, jeT. 
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where 

• D = diag(e) is an n x n diagonal matrix with Stcinhaus entries; 

• is the n X n unitary DFT matrix; and 

• T is a random set of £ coordinates, drawn from the dependent model. 

Alternatively, we can construct an SRFT by changing the distribution of the subset 
T to the independent model. An independent SRFT has £ rows only in expectation. 
For this conceptual reason, we have adopted the dependent model in the body of 
the paper. On the other hand, the technical arguments are better adapted to the 
independent model. 

The following basic result allows us to transfer conclusions from the independent 
model back to the dependent model. We learned this argument from [25, Sec. 3]. In 
the statement, we say that a real-valued function / on sets is monotonic ii S G S' 
implies f{S) < 

Proposition B.l (Decoupling). Suppose f is a monotonic function on subsets 
of {1, 2, . . . , n}. Let S be a random set of expected cardinality £ drawn from the inde- 
pendent model, and let T be a random set of cardinality £ drawn from the dependent 
model. Then 

P {f{T) < w} < 2P < u} for allueR. 
Proof. Using the law of total probability, we compute that 

p{/(^) <u} = J2'j=o^ifi^) " 1 1^1 = ^} ■ IP { 1^1 - j} 

>E' J{fiS)<u \ \S\=j}-F{\S\^j} 
>F{f{S)<u \ 1^1 = J{\S\^j} 

'^3=0 

>F{f{T)<u}-l^. 

The second inequality requires the monotonicity of /, and the third depends on the 
fact [74, Thm. 3.2] that the medians of the binomial(^, n) distribution lie between 
£ - 1 and ^. □ 

We also need a tail bound for a function of Steinhaus variates. This result follows 
from an argument that Ledoux developed for bounded, real random variables [83, 
Cor. 1.3 et seq.]. See also [84, §5.2] for a discussion of concentration in product 
spaces. 

Proposition B.2 (Steinhaus tail bound). Suppose h is a convex function on 
(complex) vectors that satisfies the Lipschitz bound 

\h{x) - h{y)\ < L \\x - y\\ forallx,y. 

Let e be a Steinhaus vector. For all t > Q, 



V{h{e) > Eh{e) + Lt} < 0"*"/^ 
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The heavy Ufting is accomplished by means of the following result, due to Rudel- 
son [114]. which depends on the noncommutative Khintchine inequality [19,91]. 

Proposition B.3 (Rudelson's Lemma). Let {aj} d C'^ be a sequence of vectors, 
and let {cj} be a sequence of complex scalars. For each integer q> 1, 



1^ . S.jCjaja* < (A:V2)^/^«e"°-^y2g • maxj \\aj\ 
where ^ is a Rademacher vector. 



IE, 



1/2 



The version here is extracted from the discussion in [132, Sec. 9], along with the 
fact that the optimal constant T)2q in the noncommutative Khintchine inequality for 
Rademacher series [20] verifies 



D 



2q 



[(2g- l)!!]'/'"^ 



(2g)! 



2iq\ 



l/2q 



< 2i/49e-0-5y2^, 



where the inequality follows from Stirling's approximation. 

B.2. The SRFT preserves geometry. The SRFT preserves the geometry of 
an entire subspace. For simplicity, we focus on showing that the operator never maps 
a vector in the subspace to zero. The same argument also shows that the SRFT does 
not inflate any vector in the subspace. We establish the following result. 

Theorem B.4 (The SRFT preserves rank). Let V be an n x k orthonormal 
matrix, where k > 246. Select a parameter £ that satisfies 



24 



Vk + V81og(25n)j log(4fc) < £ < n. 



Draw a dependent or independent i x n SRFT matrix Then 



with probability exceeding 1/2. 



The constants in Theorem B.4 are somewhat larger than we might like because 
we wanted to ensure that the statement is effective for moderate values of k and 
n. We also have the following result of a more asymptotic flavor that may be more 
attractive. 

Theorem B.5 (SRFT: Large sample bounds). Let V be an n x k orthonormal 
matrix, where /c 3> 1 and k 3> log(7i). // the number i of samples satisfies 

9fclog(4fc) <e<n, 

then a dependent or independent £ x n SRFT $ satisfies 

1 



18 



except with probability 0(fc 



-l/36\ 
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B.3. Overview of Theorem B.4. We present the main line of reasoning here, 
postponing the proofs of the lemmata to the sequel. The first step is to show that the 
matrix FD equilibrates row norms. 

Lemma B.6 (Row norms). Let V be an nx k orthonormal matrix. Then FDV 
is an n X k orthonormal matrix, and 



II *.^^.rMi k /81og(/3n) 1 
^maxJ|e;(FW)||<^- + ^^P <- 



for each /3 > 1. 



When k 3> logn, the second term in the bound is negligible, in which case the 
row norms are essentially as small as possible. On the other hand, when k < logn, 
small sample effects can make some row norms large. 

The next result states that randomly sampling rows from an orthonormal matrix 
results in a full-rank matrix. The minimum size of the sample depends primarily on 
the row norms, and the sampling procedure is most efficient when the orthonormal 
matrix has uniformly small rows. 

Lemma B.7 (Row sampling: Independent model). Let W be an nx k orthonor- 

II 1 1 2 

mal matrix, and define r ^ n ■ maxj=i....^„ ||e* W^|| . Fix rj G (0, 0.5), and select 

8rlog(4fc) 
- 1 - 2r; ■ 

Draw a random set T of expected cardinality I from the independent model. Then 

cTkiRrW) >J^ 
V n 

except with probability (1 - 77/(4 - 477))2i°g('=) < fc^''/^. 

Select W = FDV. Lemma B.6 with /? = 25 establishes that W is an orthonor- 
mal matrix whose largest row norm is essentially as small as possible, so that 



n ■ max ||e*W^|P < 

j=l,....n 



Vfc+ v/81og(25n) , 

except with probability 0.04. Next, we apply Lemma B.7 with ij ~ 1/3. For 

e > 24rlog(4fc), 



we have akiRrW) > \JTJZn except with probability 



-|21og(fe) /^^21og(fc) 



4(1-??) 



< 0.46, 



provided that k > 19. Since = ^/n/£ ■ RtW, we have established Theorem B.4 
for the independent model. 

For the dependent model, we replace Lemma B.7 with the following corollary. 
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Corollary B.8 (Row sampling: Dependent model). Let T be a random sub- 
set of exact cardinality i drawn from the dependent model. Then the conclusion of 
Lemma B. 7 holds except that the failure probability may double. 

Proof. The set function S ^ <T}^{Rg'W) is monotonic because of the interlacing 
theorem for singular values [72, Thm. 7.3.9]. The claim follows by Proposition B.l on 
decoupling. □ 

In the dependent case, we need to assume that k > 246 to obtain a total failure 
probability of at most 0.5. This point completes the proof of Theorem B.4. 

The same type of argument implies Theorem B.5. In this case, we take (3 — k 
and rj = 1/18 to obtain the result. We omit the details. 

B.4. Proofs of supporting lemmas. It remains to check that the underlying 
results are true. We begin with the claim that FE balances row norms. 

Proof. [Lemma B.6] Observe that the orthonormality condition on V is equivalent 
with V*V = Ik. Therefore, = 1 and ||y|lp = Vk. To check that FDV is also 
an orthonormal matrix, simply compute that 

{FDVy{FDV) = V*V = I 

because D, F are unitary. 

Recall that D = diag(£) for a Steinhaus vector e, so we may use the Steinhaus 
tail bound to study the deviation of the row norms of FDV. Fix a row index 
J € {1, 2, . . . , n}, and define the random variable 

h(e) = \\e*FDV\\ ^ ||e*Fdiag(£)F|| = \\£*EV\\ , 

where E = diag(e*i^), the diagonal matrix constructed from the jth row of F. It 
is easy to see that ft, is a convex function, and we quickly determine its Lipschitz 
constant: 

\h{x) - h{y)\ <\\{x- yYEVW < \\x - y\\ \\E\\ \\V\\ = ^\\x- y\\ . 

The equality holds because each entry of the diagonal matrix E has magnitude 
and V has unit spectral norm. It is also straightforward to bound the expectation of 
the random variable: 

Eh{e) < [E/i(£)2]'/' = ||W||f < \\E\\ \\V\ 

since Y has Frobenius norm ^/k. Apply the Steinhaus tail bound. Proposition B.2, 
with t = ^81og(/3n) to reach 

p|||e:(FW)|| > /81og(/?»)| < 3-8iog(,„)/8 ^ J_ 

y J II V n V n I l3n 

This estimate holds for each row index j = 1,2, ... ,n. Finally, take a union bound 
over these n events to reach the advertised conclusion. □ 

Proof. [Lemma B.7] We can control the fcth singular value of the matrix RtW 
by bounding the smallest eigenvalue of the k x k Gram matrix 




X = {RrWyiRTW). 
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The first task is to rewrite this Gram matrix. Let Wj denote the jth row of W , and 
abbreviate M = max^^i^ „ ll'^^jll- Since W is an orthonormal matrix, WjW* — I. 
Recall that the random set T ~ {j : 5j = I}, where i5i, . . . ,5„ arc independent 0-1 
random variables with common mean 5 = £/n. It follows that 



En 



This expression also makes clear that EX = 61. 

Let us define a random variable that measures the spectral-norm deviation of X 
from its mean: 



Z 



\X-6I\\ 



We will develop a condition on the parameter £ under which Z < {1 — ri)S with high 
probability. In that event, the least eigenvalue Xk of X satisfies |Afc — (5| < (1 — ri)6, 
which implies that Xk > rjS. It follows that 

akiRrW) = v^> v^ = 



which is the conclusion of the lemma. 

The major challenge is to bound an appropriate moment of the random variable, 
so we set 



E = E^'^iZ) = E^" . SjWjW* -SI 



E 



2q 



where <? > 1, the precise value to be decided later. To begin the estimate, we inject 
additional randomness via the symmetrization procedure. Let {6^} be an independent 
copy of the selector variables. By Jensen's inequality, 

E = E^" 1^ {Sj - E5^)wjw* < E^" W'^ iSj - 5'j)wjw* 

Since {Sj — Sj} is an independent family of symmetric variables, the expectation does 
not change if, for each j, we multiply the jth summand by an independent Rademacher 
variable and average. 



E < 



Eg Es,S' 1^^. - dj)wjW* 
Es,s' E^ ^ . ^j{Sj - Sj)wjW* 



2q 



2q 



1 1/2? 



l/2q 



where we use independence of ^ from the other variables to justify the interchange of 
the expectations. See [85, Lem. 6.3] for more discussion on symmetrization. 

Rudelson's lemma, conditional on the choice of the selectors d and S' , yields 



E< (fcV2)i/2«e-°-5V^-M-: 



f2q 



lE,i^.-^;i 



1/2 



Jensen's inequality permits us to move the moment inside the square root: 



E < (fc\/2)i/29e-0-5y2^ . M ■ [E^" |^ . \S, 



-3 ^']\ 



1/2 
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Let us make some bounds on the large parenthesis. Each matrix WjW* is psd, so the 
norm increases monotonically with each scalar coefficient. Thus, we can introduce the 
inequality \5j — <5jP < Sj + Sj and invoke the triangle inequality (twice) to obtain 



where the last identity follows from the identical distribution of the sequences. Next, 
we add and subtract the scalar matrix i5I inside the norm, apply the triangle inequal- 
ity, and then identify a copy of E: 

E^' l^.^jiojio* <E^''j^'^_SjWjW*-5I +\\SI\\ = E + S. 

Combining these observations, we discover 

E < (fcV2)i/2«e-°-^v^ ■M-{E + 

To minimize the constant, select q ~ \\og{k^/2)~\ , which is roughly optimal. In sum- 
mary. 



E < v/4^M2 • (E + S)^/^. 
To study this relation, we make the change of variables F — S^^E so that 



2 / 4qAf2 



F^ < 



(F + l). 



A rather tedious (but elementary) computation shows that solutions to i^^ < a{F+l) 
satisfy the rule 



a < -(1-27?) 



F<l--rj. 



Therefore, 



AqAP 1 , 



F <l - -n 
4 ' 



E<{l^-rj]S. 



Recalling the definition of E along with the facts S = £/n and q < log(4fc), we see 
that 



i> 



8{APn) log(4fc) 
1 - 2t] 



In words, a lower bound on £ delivers an upper bound for a carefully chosen moment 
of the random variable Z. 

Given this information, we can produce a tail bound for Z with minimal effort 
by invoking Markov's inequality. 



'{Z> (1-77)5} < 



E2'?(Z) 

(T^ 



2q 
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Introduce the estimate for the moment, and use the lower bound q > \ogk to reach 



'{Z > (1 - r/)(5} < 



1 



5^/4 \ 



2q 



1-77 J 

We can further bound the right-hand side as 



< 



1 - 



4(1-7?) 



21og(fc) 



1 



1] 



4(1 - 7/) 



21og(fc) 



< (1 - 77/4)2 = fc-21og(l-V4) < 



This point completes the argument. □ 

Remark B. 1. In the proof of Lemma B.7, we used Markov's inequality to bound 
the probability that Z exhibits a large deviation above its mean. In the present 
setting, this approach produces an imprecise conclusion. A more careful argument, 
reminiscent of [115, Thm. 3.9], shows that large deviations are even less likely. 

Remark B.2. The logarithmic factor in these results is necessary^ as we show 
by example. Fix an integer fc, and let n = k^. Form an n x k orthonormal matrix W 
by vertically stacking k copies of the k x k unitary DFT, scaled so that \\W\\ = 1. 
Suppose that we sample random rows from W to produce a matrix X. To ensure 
that crfc(X) > 0, we must pick at least one copy each of the k distinct rows. This is 
a coupon collection problem [56]. It is well known that, to obtain a complete set of k 
coupons (i.e., rows) with nonnegligible probability, about klogk draws are required. 
We accrue essentially no advantage by sampling without replacement because the 
matrix has too many rows. 

It is indeed possible for a matrix of the form W = FDV to present this unde- 
sirable property. Consider the matrix V formed by regular decimation of the n x n 
identity matrix. More precisely, let V be the nx k matrix whose jth row is fc^^/^e*^^, 
when j = (mod k) and zero otherwise. 
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